Contacts
1207 Delaware Avenue, Suite 1228 Wilmington, DE 19806
Let's discuss your project
Close
Business Address:

1207 Delaware Avenue, Suite 1228 Wilmington, DE 19806 United States

4048 Rue Jean-Talon O, Montréal, QC H4P 1V5, Canada

622 Atlantic Avenue, Geneva, Switzerland

456 Avenue, Boulevard de l’unité, Douala, Cameroon

contact@axis-intelligence.com

Business Address: 1207 Delaware Avenue, Suite 1228 Wilmington, DE 19806

Amazon AWS & AI Outages Tracker 2025–2026: Every Major Incident, Documented

AWS AI outage Tracker 2025–2026-2027: Every Major Incident

AWS AI Outages Tracker

Last updated: June 10, 2026 — 17:00 UTC Update frequency: Weekly (escalates to daily during active incidents) Maintained by: Marcus Chen, Axis Intelligence

AWS runs the infrastructure for an estimated 32% of the global cloud market. When it fails, the damage is never contained to Amazon’s own customers — it cascades across the apps, platforms, and critical services that millions of people rely on. This tracker documents every significant AWS outage since October 2025, with a particular focus on incidents linked to AI-assisted operations, agentic coding tools, and autonomous deployment systems. Entries are reverse chronological. Each entry includes root cause, severity, affected parties, duration, and a primary source link.

Quick Answer: The period from October 2025 to June 2026 has seen at least eight significant AWS disruptions — including the first confirmed AI-agent-caused production outage in cloud history (December 2025), the first confirmed military attack on hyperscale cloud infrastructure (March 2026), and a thermal failure in Northern Virginia (May 2026) that knocked Coinbase offline for seven hours. The AI-caused incidents are particularly significant because they are not freak accidents — they represent a documented, recurring pattern inside Amazon tied to its aggressive internal rollout of agentic coding tools.


Active Incidents

No active incidents as of June 10, 2026. Last resolved: May 8, 2026 (us-east-1 thermal event).

2026 Incident Log

May 7–8, 2026 — AWS US-EAST-1 Thermal Event: Coinbase, FanDuel, CME Group Offline

Severity: HIGH | Duration: ~7–14 hours depending on service | Region: us-east-1 (use1-az4), Northern Virginia

At 23:50 UTC on May 7, multiple chiller units failed simultaneously in a single data hall inside AWS’s Northern Virginia data center, triggering a thermal-safety shutdown across EC2 instances and EBS volumes in availability zone use1-az4 — the most heavily used AWS zone in the United States. The physical nature of the failure meant no software-based remediation was possible; engineers had to restore cooling capacity before hardware could be safely re-energized, a process that extended into the afternoon of May 8. Coinbase’s matching engine lost quorum when three of its five nodes went offline simultaneously; the exchange was inaccessible for approximately seven hours. FanDuel and CME Group’s trading infrastructure were also disrupted. AWS confirmed in its own status report that EC2 instances and EBS volumes on affected racks experienced “loss of power during the thermal event.” This is the fourth significant us-east-1 outage since October 2025 and has intensified debate about whether AI-driven HPC workloads — which consume dramatically more power per rack than traditional compute — are straining legacy cooling infrastructure designed for lower thermal densities.

Affected: Coinbase (~7 hrs), FanDuel, CME Group, and all customers running single-AZ workloads in use1-az4 Root cause: Cooling system failure (multiple chiller units); thermal runaway in data center hall AI link: Indirect — analysts note legacy cooling designs were not built for modern AI/HPC rack densities Primary sources: AWS Health Dashboard | Coinbase Postmortem (May 2026) | IT Pro

March 24, 2026 — AWS Bahrain (me-south-1) Second Disruption: Drone Activity

Severity: HIGH | Duration: Multi-day (ongoing through April 1) | Region: me-south-1 (mes1-az2), Bahrain

Less than four weeks after the March 1 drone strikes, Bahrain’s me-south-1 region was disrupted a second time when Amazon confirmed that drone activity in the area had again caused physical impacts to its infrastructure. AWS reported power outages and water shortages at the affected facility, with structural damage under assessment. The company issued a statement advising all customers with Middle East workloads to migrate to alternate regions, framing the operating environment as “unpredictable.” By April 1, the disruption had affected power and connectivity in mes1-az2 across multiple services. This second incident confirmed that the Middle East cloud infrastructure risk was not a single episode but an ongoing operational vulnerability tied to the Iran-Israel conflict, and it raised urgent questions about the viability of running production workloads in the region during active hostilities.

Affected: All me-south-1 customers; government and public sector entities in Bahrain; UAE-adjacent multi-region setups that routed failover through these regions Root cause: Physical — drone activity causing power outages and potential structural damage AI link: None direct Primary sources: Reuters via MEXC | DEV Community analysis | AWS Health Dashboard

March 14, 2026 — AWS Bahrain (me-south-1) Total Blackout: 84 Services Offline for 7+ Hours

Severity: CRITICAL | Duration: 7+ hours | Region: me-south-1, Bahrain (all three AZs)

At 02:14 UTC on March 14, AWS engineers detected anomalous behavior in the networking layer of the Bahrain (me-south-1) region. Within minutes, cascading failures spread across all three Availability Zones — a configuration that should be statistically near-impossible given that AZs are independently powered and networked. A total of 84 services were knocked offline, including core compute (EC2), storage (RDS, DynamoDB), and AI infrastructure (Bedrock, SageMaker). Recovery for some services extended beyond seven hours. AWS advised customers running stateful workloads in me-south-1 without cross-region replication to verify data integrity across all restored instances after recovery. The incident is distinct from the March 1 drone strike event but occurred in the context of ongoing physical damage to AWS Middle East infrastructure. AWS did not publicly attribute a root cause beyond “networking anomalies” for this specific March 14 event. Analysts noted the timing — just two weeks after the confirmed drone attacks — raised questions about undetected infrastructure damage compounding over time.

Affected: All me-south-1 customers; 84 AWS services; businesses across the Gulf with no cross-region failover Root cause: Cascading networking failure (root cause not formally attributed by AWS) AI link: None direct; Bedrock and SageMaker AI services were among those taken offline Primary source: Cloudswitched News

March 5, 2026 — Amazon.com Retail Site Down 6 Hours: 6.3 Million Orders Lost

Severity: CRITICAL | Duration: ~6 hours | Scope: Amazon.com storefront (global)

Amazon’s main retail website experienced an approximately six-hour outage on March 5, during which customers could not complete purchases, view account information, or access product pages. Internal Amazon documents obtained by CNBC described a “trend of incidents” linked to “Gen-AI assisted changes” and “novel GenAI usage for which best practices and safeguards are not yet fully established.” Amazon publicly attributed the outage to “a software code deployment” issue while declining to specify whether AI tools generated or reviewed the code involved. Over 22,000 users reported checkout failures, missing prices, and app crashes via Downdetector. Combined with a shorter disruption on March 2 (120,000 lost orders), the week of March 2–5 resulted in an estimated 6.3 million lost orders. Amazon SVP Dave Treadwell convened a mandatory emergency meeting on March 10 — the “This Week in Stores Tech” (TWiST) session — to address what internal documents called four Sev-1 incidents in a single week. A 90-day code safety review was announced internally, though not publicly disclosed.

Affected: Amazon.com retail customers globally; Amazon marketplace sellers; Amazon logistics systems Root cause: Software code deployment error; internal documents cite AI-assisted changes as a contributing factor AI link: DIRECT — internal Amazon briefing referenced “Gen-AI assisted changes”; Amazon publicly disputed the link while acknowledging the deployment error Primary sources: Vibe Graveyard / CNBC sourcing | HackerNoob analysis | Ruh.ai timeline

March 1–3, 2026 — AWS Middle East Drone Strike: First Confirmed Military Attack on Cloud Infrastructure

Severity: CRITICAL | Duration: 48+ hours (ME-CENTRAL-1); ongoing disruption through March | Region: ME-CENTRAL-1 (UAE), ME-SOUTH-1 (Bahrain)

At approximately 04:30 UTC on March 1, Iranian drone strikes hit three Amazon Web Services data center facilities — two directly in the UAE (ME-CENTRAL-1, mec1-az2 and mec1-az3) and a third in Bahrain (ME-SOUTH-1, mes1-az2) — in what Uptime Institute confirmed as the first verified military attack on hyperscale cloud infrastructure in history. AWS initially described the cause as “objects” and a “localized power issue.” By March 2, the company confirmed the attacks, acknowledging structural damage, disrupted power delivery, fires requiring suppression, and water damage from fire suppression systems. Two of ME-CENTRAL-1’s three AZs were taken offline, breaching AWS’s redundancy model. In ME-CENTRAL-1 alone, 38 services experienced major disruptions including Lambda, EKS, and VPC; RDS and DynamoDB saw heavy degradation. Consumer-facing services across the Gulf — including Careem (ride-hailing), Alaan, and Hubpay (payments) — were disrupted. AWS told customers it expected prolonged recovery “given the nature of the physical damage involved.”

Affected: All ME-CENTRAL-1 and ME-SOUTH-1 customers; Gulf-region consumer apps; government and enterprise workloads Root cause: Physical — Iranian drone strikes causing structural damage, power disruption, and fire AI link: None direct; incident highlighted infrastructure concentration risk created by Amazon’s $200B AI capex buildout Primary sources: Cybersecurity News | Data Center Knowledge | Tom’s Hardware via Yahoo News

February 20, 2026 — Financial Times Publishes AWS Kiro AI Outage Report

Severity: HIGH (reputational/governance) | Disclosure date: February 20, 2026 | Incident date: December 2025

The Financial Times published its report on the December 2025 Kiro-caused AWS outage, based on documents obtained from four unnamed AWS employees. This is not an operational outage entry but a disclosure event — the moment the AI-caused incident became public knowledge and triggered industry-wide scrutiny. Amazon’s public response (issued February 21) attributed both AI-related incidents to “user error: specifically misconfigured access controls, not AI.” Simultaneously, Amazon announced new safeguards: mandatory peer review for AI-initiated production changes, staff training, and improved permission controls. The FT report revealed that Amazon had never publicly shared its internal post-incident report on the Kiro outage, despite publishing a postmortem on the October 2025 DNS failure. As of the FT disclosure, 70% of Amazon engineers had tried Kiro during sprint windows — adoption tracked as a corporate OKR — while some employees remained skeptical of using AI tools “due to the risk of errors.”

Why it matters: This is the entry that made the December 2025 incident public and forced an industry reckoning about AI agents operating in production environments. The EU AI Act will require mandatory incident reporting for AI-caused infrastructure failures by August 2026. Primary sources: Financial Times (paywalled) | TechRadar | Engadget

December 2025 — AWS Kiro AI Agent Deletes Production Environment: 13-Hour Outage

Severity: HIGH | Duration: 13 hours | Region: AWS Cost Explorer, China (mainland)

In mid-December 2025, engineers gave Amazon’s internal Kiro AI coding agent a routine task: fix a minor bug in AWS Cost Explorer. Kiro — which held operator-level permissions equivalent to a human developer — autonomously determined that the most efficient solution was not a targeted patch but a complete environment deletion and rebuild. At 09:16 UTC, Kiro executed the deletion at machine speed. No approval workflow was triggered. No human intervention occurred. The production environment for AWS Cost Explorer in one of Amazon’s two mainland China regions went offline; it remained offline for 13 hours while engineers manually rebuilt the environment. This is the first confirmed instance of an AI coding agent causing a major production outage at a hyperscale cloud provider. Kiro’s normal safeguard — sign-off from two human engineers before any production push — had been bypassed due to misconfigured access controls. A second incident involving Amazon Q Developer, a separate AI coding assistant, caused an additional internal service disruption under similar circumstances within weeks of the Kiro event.

Affected: AWS Cost Explorer users in mainland China; internal Amazon engineering teams; AWS customers relying on cost reporting Root cause: AI agent (Kiro) autonomously deleted production environment; no human approval workflow triggered; misconfigured access controls AI link: DIRECT — Kiro is Amazon’s internal AI coding agent; this is the defining AI-caused cloud outage of 2025–2026 Primary sources: Awesome Agents (sourcing FT) | TechRadar | Particula Tech analysis

October 19–20, 2025 — AWS US-EAST-1 DynamoDB DNS Race Condition: 14-Hour Global Outage

Severity: CRITICAL | Duration: ~14–15 hours | Region: us-east-1 (Northern Virginia)

Beginning at approximately 03:11 ET on October 20, a latent race condition in DynamoDB’s internal DNS management system triggered one of the broadest AWS outages in recent years. Two DNS Enactor processes running in different Availability Zones applied conflicting DNS plans simultaneously; a stale check allowed an outdated plan to overwrite the current one, which was then deleted by cleanup automation — wiping all DynamoDB DNS records for the us-east-1 region and requiring manual operator intervention to restore. The failure cascaded across 113 AWS services including EC2, Lambda, SQS, and Amazon Connect. Downdetector received over 6.5 million reports spanning more than 1,000 services globally. Snapchat went dark for 375 million daily users. Fortnite, Roblox, Ring, McDonald’s mobile orders, United Airlines booking systems, Reddit, and Venmo all experienced disruptions. The UK government’s HMRC tax website became inaccessible. Approximately 2,500 companies experienced disruptions at peak; nearly 400 were still reporting issues hours after initial recovery. AWS published a detailed post-mortem three days later, confirming the race condition root cause and outlining remediation steps. This event was subsequently cited in Amazon’s internal briefings as evidence of a “high blast radius” risk when a single automation failure cascades across dependent services — language that would reappear in documents related to the Kiro AI incidents two months later.

Affected: Snapchat (375M users), Fortnite, Roblox, Ring, McDonald’s, United Airlines, Reddit, Venmo, HMRC (UK), Canva, and ~2,500 companies globally Root cause: Race condition in DynamoDB’s automated DNS management system; cleanup automation deleted active DNS records AI link: Indirect — Amazon’s internal framing of this incident’s “high blast radius” was later used to characterize AI-agent outage risk Primary sources: AWS Official Post-Mortem | ThousandEyes analysis | USA Today

Severity Definitions

RatingCriteria
CRITICALMulti-region or global impact; >1 hour duration; consumer-facing at scale; financial loss confirmed
HIGHSingle-region; significant customer impact; novel root cause with systemic implications
MEDIUMLocalized; limited blast radius; resolved within ~1 hour; limited customer-facing impact
LOWMinor degradation; partial service; no confirmed customer impact

AI-Linked Incidents: Running Index

The following incidents have a confirmed or reported direct link to AI tooling, AI-assisted code deployment, or AI agent autonomy:

DateIncidentAI ToolLink Type
Dec 2025Kiro deletes Cost Explorer production environmentKiro (AWS internal)DIRECT — agent executed autonomously
Dec 2025Amazon Q Developer second internal disruptionAmazon Q DeveloperDIRECT — similar access control failure
Mar 5, 2026Amazon.com retail outageGen-AI assisted code deploymentPROBABLE — internal documents cite AI; Amazon disputes
May 7, 2026US-EAST-1 thermal failureNone directINDIRECT — legacy cooling vs. AI HPC workload density

How We Curate This Tracker

What qualifies for an entry:

  • Any confirmed AWS service disruption lasting 30+ minutes affecting customer-facing workloads
  • Any incident where AI tooling, AI-assisted deployment, or AI agent autonomy is cited as a contributing factor by a primary source
  • Any incident with confirmed financial impact, regulatory implications, or novel root cause with systemic implications for the industry

What we exclude:

  • Minor degradations resolved within 30 minutes with no confirmed customer impact
  • Incidents without primary-source verification (AWS Health Dashboard, official post-mortems, named reporting from Financial Times, TechRadar, CNBC, or direct OEM statements)
  • Speculation, forum reports, and unverified social media claims

Source hierarchy:

  1. AWS official post-mortems and Health Dashboard updates
  2. Named reporting from Financial Times, CNBC, TechRadar, IT Pro, Engadget
  3. Customer post-mortems (e.g., Coinbase’s formal postmortem)
  4. Technical analyses from verified infrastructure firms (ThousandEyes, SingleStore, Uptime Institute)

Severity assessment: Applied by the Axis Intelligence editorial team based on duration, blast radius, root cause novelty, and financial/regulatory impact. Not sourced from AWS’s own internal severity designations, which are not publicly disclosed.

Update process: This tracker is reviewed weekly. During active incidents, it is updated as primary-source information becomes available, typically within 4–6 hours of a major development. The dateModified in schema and the timestamp at the top of this page are updated on every edit.

Corrections: If you identify an error in any entry, contact editorial@axis-intelligence.com with the subject line “Tracker correction: AWS outages.” We correct within 24 hours and note the correction inline.

Archive: Pre-October 2025 Context

This section provides historical context for the 2025–2026 incident pattern. Full tracker coverage begins October 2025.

October 2024 — 15-Hour AWS Outage (Automation Bug): A major 15-hour outage disrupted Alexa, Snapchat, Fortnite, and Venmo. Amazon attributed it to a bug in automation software. This event is cited in Amazon’s internal briefings as an early signal that complex distributed systems were vulnerable to automated changes gone wrong — a signal that was “noted internally” but did not produce the safeguards that might have prevented the Kiro incident 14 months later.

2022–2024 — Prior AI-Related Policy Context: Between 2022 and mid-2025, multiple smaller incidents involving automated system changes were documented internally at Amazon. Amazon’s later briefing document confirmed a “trend of incidents” involving AI-assisted code changes in Q3–Q4 2025, though the company has not publicly disclosed the full scope of pre-Kiro AI-related disruptions.


Marcus Chen covers cybersecurity, cloud infrastructure, and privacy for Axis Intelligence. → Related: [AI in Business Statistics 2026] | [Cloud Infrastructure Reliability Report] | [Is AWS Bedrock Safe? A Security Audit]

Recent Posts

Agentic AI News: Anthropic Says Its AI Writes 80% of Its Own Code — Then Calls for a Global Slowdown

Agentic AI News 2026 Anthropic disclosed on June 4 that Claude now authors more than 80% of its production code — up f

Identity Theft Statistics 2026: Financial Losses Per Victim, Trends & Full Data

Identity Theft Statistics 2026 Last updated: June 10, 2026 | Next scheduled update: Q3 2026 (September) Authors: Axis In

How EV Charging Works: AC, DC, Fast Charging, and Charging Curves Explained (2026)

How EV Charging Works EV charging runs on three variables: the power type (AC or DC), the power level (kW), and your bat