What the AWS Outage Exposed About Continuity, Dependency, and Real Resilience

The recent AWS outage in the US-East-1 region wasn’t just another cloud hiccup. It was a leadership test for IT organizations across every industry, private sector, government, healthcare, finance. Some businesses stayed online. Most waited to recover. And everyone was reminded of an uncomfortable reality:

We’ve built mission-critical operations on cloud infrastructure, but we haven’t built a way to operate when it fails.

This wasn’t an attack. It wasn’t ransomware or a foreign adversary. It was an internal DNS automation failure at AWS. But the business impact was the same. Systems down, transactions halted, customers locked out, revenue on hold.

What Actually Happened

On October 20, 2025, around 3 a.m. ET, AWS experienced a failure inside its DNS automation and monitoring systems. A change removed critical DNS records for DynamoDB service endpoints, and automated recovery didn’t fix it. Applications could no longer resolve service addresses, which caused cascading failures.

By 6:01 p.m. ET, AWS declared services recovered. But the damage was already done.

Who Was Affected and How Bad Was It?

This wasn’t a small or regional outage.

  • According to the Guardian and Reuters, thousands of companies were affected globally. Some sources estimate well over 2,000 directly impacted providers and platforms.

  • Millions of end users reported failures, with more than 4 million outage reports recorded at peak.

  • Major platforms were affected, including Venmo, Ring, airlines, healthcare portals, banks, authentication platforms, retail sites, smart home platforms, and gaming services like Fortnite and Roblox.

  • Even government agencies and enterprises with segmentation, zoning, or isolated VPCs experienced interruptions, because the failure was at the provider layer, not inside customer environments.

How Much Did It Cost Businesses?

No one has a final number, but we do know this:

  • CyberCube estimates insured financial losses between $38 million and $581 million from this one event.

  • Broader economic impact, including lost revenue, productivity, and supply chain disruption, is expected to land in the hundreds of millions to billions.

  • Industry benchmarking tells us many enterprises lose between $300,000 and $5 million per hour of downtime. Some exceed $16,000 per minute.

  • A mid-size ecommerce business processing $100,000 a day in sales could have lost around $60,000 during the outage window, not including customer churn or recovery cost.

Multiply that across thousands of businesses over a 12 to 15-hour disruption, and the numbers add up quickly.

Why Some Organizations Stayed Online and Most Didn’t

This outage revealed a clear split.

Organization Type Outcome
Businesses with tested failover to Google Cloud, Azure, on-prem systems, or active-active infrastructure Continued operating or saw minimal disruption
Organizations fully dependent on AWS US-East-1, often because of SaaS platforms, APIs, or identity providers Operations paused until AWS recovered
Enterprises and government agencies with zoning, segmentation, or VPC separation Still affected, because control plane and DNS failures sit above those protections
Organizations relying on Azure as a fallback Were reminded Azure suffered a similar outage earlier in the year

Azure Had the Same Problem with a Different Cause

Earlier in 2023, Azure experienced a major global disruption. A surge in network traffic and DDoS attempts caused Azure Portal and core services to become unavailable. Management functions, authentication, and critical services stalled.

Azure did not publish how many businesses were affected. AWS provided more detail in this recent event. Different clouds, different trigger points, same result. Operations stopped, IT teams scrambled, and executives wanted answers.

This isn’t about which cloud is safer. It’s about whether your business can keep operating when any cloud fails.

The Real Issue: Over-Dependency Without a Backup Plan

Three big problems were exposed.

  1. Most organizations don’t have a way to function without their primary cloud provider. They have redundancy inside AWS or Azure, but nothing outside.

  2. We’ve mistaken segmentation for resilience. Zoning, VPCs, and multi-AZ architectures are valuable, but they don’t protect against failures in DNS, identity, or control plane systems at the provider.

  3. Business continuity hasn’t evolved at the same pace as cloud adoption. We moved applications, identity, and data to the cloud, but we never built an exit ramp for when the cloud is the problem.

What CIOs, CISOs, and IT Directors Should Be Doing Now

1. Map All Dependencies

Not just your servers. Look at identity, authentication, DNS, APIs, SaaS platforms, payments, analytics, HR systems, payroll tools. If it stops working when AWS or Azure is down, it’s a dependency you need to know about.

2. Build Continuity Outside the Cloud Provider

That could mean a secondary provider like Google Cloud or Azure, on-prem hardware, or private infrastructure for core workloads. It could include secondary DNS, offline authentication modes, or replicated data stores that aren’t tied to a single vendor.

3. Test What Happens During Failure

Run real-tabletop exercises. If AWS goes down at 10 a.m., what do you do in the first hour? Can you authenticate employees? Take payments? Access critical data? Communicate with customers and leadership?

4. Explain This Clearly to Executives

Don’t sell panic. Sell transparency and readiness.

Cloud is still the right strategy, but it isn’t immune to failure. Our goal isn’t to move away from AWS or Azure. It’s to make sure we can keep working when they don’t.

This isn’t only an IT problem. It’s business continuity.

Final Thought

Cloud is still the best place to build, scale, and secure modern systems. But uptime isn’t guaranteed. The AWS outage wasn’t a random glitch. It proved that cloud dependency without continuity is a business risk.

Some organizations stayed online. Others waited in the dark. The difference wasn’t budget or size. It was planning.

The new measure of IT leadership isn’t how well things run on a good day. It’s how well they run when the cloud isn’t there.

Next
Next

Zero-Day Attacks and Why Every Business Needs MSSP Protection