LLM Production Incident Tracker 2026
A living database of real-world incidents involving frontier large language models in production deployments. Every incident is verified against multiple sources, attributed to a specific model where possible, scored for severity, and tagged with the defensive lessons it teaches. Updated weekly.
Inaugural release: April 2026 · Updated weekly · Last update: April 30, 2026 · Maintained by Axis Intelligence Research
Table of Contents
Key Finding
We have documented and verified 187 production incidents involving frontier large language models between January 2024 and April 2026. 62% involved generative output failures (hallucination, fabrication, or harmful content) reaching real users. 23% involved data exposure or privacy violations. The remaining 15% span agent failures, autonomous system errors, and tool-misuse incidents.
The single most consequential pattern: in 71 of 187 incidents (38%), the deploying organization had no documented detection mechanism — the incident only became visible because an end user, journalist, or external researcher flagged it publicly. This is the highest-leverage gap we observed: organizations are deploying LLMs faster than they are deploying the means to monitor what those LLMs produce.
How This Tracker Differs From What Already Exists
The AI Incident Database (AIID) maintained by the Responsible AI Collaborative is the foundational public catalog of AI incidents. It is excellent at what it does, and we draw on it as one of our source streams. We are not trying to replace it.
Our tracker fills a different gap. AIID is a generalist catalog spanning every kind of AI system — autonomous vehicles, vision systems, recommender systems, robotics, and language models — across nearly two decades. By design, it is broad and inclusive.
The Production LLM Incident Tracker is narrow and specific:
- Scope: Frontier large language models (ChatGPT, Claude, Gemini, Llama, Mistral, Grok, and major derivatives) in production deployments. We exclude vision-only systems, autonomous vehicles, and pre-LLM machine learning incidents.
- Time window: January 2024 onward, the period in which production LLM deployment became widespread.
- Attribution: Where possible, each incident is attributed to the specific model and version. AIID typically catalogs incidents at the system level; we go to the model level when the evidence supports it.
- Severity scoring: Each incident receives a Production Incident Severity Score (PISS, defined below). AIID does not currently score severity in a standardized way.
- Defensive lens: Each entry is tagged with the detection gap, mitigation lesson, and recommendation it generates. The goal is operational — what should defenders do differently because this incident happened?
Where an incident in our database is also catalogued in AIID, we cross-reference the AIID ID. Researchers should treat this tracker as a focused supplement to AIID, not a competitor.
What Counts as a Production Incident
Not every output a model gets wrong is an “incident” in the sense used here. We use a strict definition:
A production incident is an event where:
- A frontier LLM was deployed in a real-world product or service used by people other than the developers.
- The model’s behavior caused, or had reasonable potential to cause, harm to a person, organization, or system.
- The event was confirmed by at least two independent sources (news reports, court filings, regulatory disclosures, vendor statements, peer-reviewed papers, or our own verification).
- The event is documentable — we can describe what happened in concrete terms.
Things we exclude from the tracker:
- Lab vulnerabilities not yet observed in production (those go in our AI Model Vulnerability Tracker).
- Unverified single-source claims, including unverified social media reports.
- Incidents predating January 2024 (covered by AIID).
- Hypothetical or red-team findings without production reproduction.
- “Bad outputs” without identifiable harm — a model getting a trivia question wrong is not an incident; a model providing wrong medical dosage information that a patient acted on is.
The threshold matters. Every entry in this database represents real verified harm or near-harm in production.
How We Find and Verify Incidents
Our intake pipeline runs continuously and draws from twelve source streams:
| Source Stream | Update Cadence | Notes |
|---|---|---|
| Tier-1 news monitoring | Daily | Reuters, AP, NYT, WSJ, FT, BBC, Bloomberg, Le Monde |
| Specialty tech press | Daily | Wired, MIT Technology Review, The Register, 404 Media, The Verge |
| Court filings (US, EU, UK) | Weekly | Federal court PACER searches, EU court records, UK judiciary |
| Regulatory filings | Weekly | FTC, FCC, EU AI Office, ICO, CNIL, FDA, FINRA |
| Academic preprints | Weekly | arXiv, SSRN |
| Vendor disclosures | Continuous | Official statements from model vendors and deployers |
| AI Incident Database (AIID) | Daily sync | Cross-reference for overlap |
| Bug bounty disclosures | Weekly | Public Bugcrowd, HackerOne disclosures |
| Security research blogs | Daily | Specialty AI security researcher publications |
| User-submitted reports | Continuous | Form on our research page; subject to verification |
| Reddit / Hacker News surfacing | Daily | Used as discovery only, never as primary source |
| Our own AVI Tracker reproductions | Continuous | When a lab finding becomes a reported production incident |
For any candidate incident, we require two independent confirming sources before publication. If the only source is a single tweet, single news article, or single user report, the incident enters our review queue but does not become a published entry. Many candidate incidents never make it through verification — we estimate roughly 40% of intake gets dropped at this stage.
We document the source list per incident in the public database so readers can verify our verification.
The Production Incident Severity Score (PISS)
We score each incident on a 0.0–10.0 scale composed of four sub-scores. This approach mirrors the AVI score from our vulnerability tracker, with adaptations for production-incident context.
Affected Population (0.0–3.0)
How many people were affected by the incident.
| Score | Criterion |
|---|---|
| 3.0 | Confirmed impact on 1M+ people, or critical infrastructure |
| 2.5 | 100K–1M people affected |
| 2.0 | 10K–100K people, or single high-stakes population (e.g., children, patients) |
| 1.5 | 1K–10K people |
| 1.0 | 100–1K people |
| 0.5 | Fewer than 100, or one identifiable affected party |
| 0.0 | Near-miss with no confirmed affected parties |
Harm Severity (0.0–3.0)
The seriousness of the harm caused.
| Score | Criterion |
|---|---|
| 3.0 | Death, serious physical injury, or comparable irreversible harm |
| 2.5 | Major financial loss (individual >$10K or organizational >$1M); legal jeopardy for affected party; significant ongoing psychological harm |
| 2.0 | Confirmed privacy violation involving sensitive PII; substantive financial or legal harm |
| 1.5 | Reputational harm; minor financial loss; service degradation affecting livelihoods |
| 1.0 | Misinformation reaching large audiences without confirmed downstream harm |
| 0.5 | Embarrassing or policy-violating output without measurable downstream harm |
| 0.0 | No confirmed harm beyond the incident itself |
Detection Failure (0.0–3.0)
How long it took, and how the incident was detected. Higher scores indicate worse detection.
| Score | Criterion |
|---|---|
| 3.0 | Detected only after media coverage or regulatory action; deployer was unaware |
| 2.5 | Detected by external researcher, customer complaint, or end user |
| 2.0 | Detected internally but only after substantial harm had occurred |
| 1.5 | Detected internally during routine review |
| 1.0 | Detected by automated monitoring within hours |
| 0.5 | Detected by monitoring before any user impact |
| 0.0 | Caught in pre-deployment testing (and therefore not actually a production incident) |
Vendor/Deployer Response (0.0–1.0)
The quality and timeliness of the response.
| Score | Criterion |
|---|---|
| 1.0 | No acknowledgment, denial, or hostile response to disclosure |
| 0.75 | Delayed acknowledgment (>30 days), partial mitigation |
| 0.5 | Acknowledged within 7–30 days, mitigation in progress |
| 0.25 | Rapid acknowledgment, mitigation shipped within 7 days |
| 0.0 | Detected and mitigated before disclosure became necessary |
Composite
PISS = Affected Population + Harm Severity + Detection Failure + Vendor Response
- Critical (8.0–10.0): Severe, broad, poorly detected, slow response. 11 incidents in current dataset.
- High (6.0–7.9): Serious incidents with significant deficiencies. 38 incidents.
- Medium (3.0–5.9): Confirmed harm with bounded impact. 84 incidents.
- Low (0.1–2.9): Documented incidents with limited impact or strong response. 54 incidents.
The PISS rubric is published under CC BY 4.0. We hope other researchers adopt it. A standardized score makes year-over-year and cross-deployer comparisons meaningful.
Aggregate Findings: 2024–April 2026
Distribution by category (n=187)
| Category | Count | % of Total | Avg. PISS |
|---|---|---|---|
| Hallucination causing user harm | 47 | 25.1% | 5.4 |
| Fabricated citations / sources | 31 | 16.6% | 4.8 |
| Sensitive data exposure | 29 | 15.5% | 7.1 |
| Discriminatory output | 18 | 9.6% | 6.2 |
| Agent / tool misuse | 17 | 9.1% | 6.8 |
| Policy-violating content reaching users | 16 | 8.6% | 4.5 |
| Misattribution of statements to real people | 12 | 6.4% | 5.7 |
| Unauthorized actions in agent contexts | 10 | 5.3% | 7.4 |
| Cross-user data leakage | 4 | 2.1% | 8.6 |
| Autonomous system override failures | 3 | 1.6% | 8.9 |
Distribution by sector
| Sector | Incidents | Notable patterns |
|---|---|---|
| Legal | 34 | Fabricated case citations dominate |
| Healthcare | 26 | Misdiagnosis assistance, dosage errors, triage misclassification |
| Financial services | 23 | KYC failures, fraudulent advice, loan-decision biases |
| Government / public sector | 21 | Benefits-determination errors, immigration document errors |
| Media / journalism | 19 | Fabricated quotes, attribution errors, image-text injection in articles |
| Education | 16 | Plagiarism detection false positives, automated grading errors |
| Customer service | 14 | Promised refunds/services the company did not authorize |
| Recruitment | 12 | Discriminatory ranking, fake-credential acceptance |
| Retail / e-commerce | 9 | Product description hallucinations, return-policy errors |
| Other | 13 | Mixed — mostly internal tooling failures |
Distribution by detection mechanism
This distribution is the most strategically important finding in the report.
- End user reported it publicly: 71 incidents (38%)
- External researcher / journalist: 38 incidents (20%)
- Regulator / court: 24 incidents (13%)
- Internal review (post-harm): 31 incidents (17%)
- Automated monitoring: 18 incidents (10%)
- Pre-deployment testing miss caught later: 5 incidents (3%)
The implication is unambiguous: most production LLM incidents are detected by people external to the organization that deployed the model. Internal monitoring has not kept pace with deployment surface area. This is a structural gap, not a failure of any particular vendor.
Findings by Category
Hallucination Causing User Harm (47 incidents)
The largest single category. These are incidents where the model produced fluent, confident, factually wrong information that a user acted on. The harm pattern is consistent across sectors: a user trusts the output, takes action based on it, and discovers the error only after consequences have begun.
Notable subclasses:
- Fabricated citations in legal filings (19 incidents). Lawyers submitting briefs with non-existent case citations remains the single most consistent failure mode. Despite court sanctions and bar association warnings, new incidents continue at roughly 1 per month. The pattern: junior attorneys using LLMs without sufficient verification, in jurisdictions where the practice norms have not caught up.
- Medical dosage errors (8 incidents). Patients or family members receiving incorrect dosage guidance from consumer chatbots. None of the documented cases involved deaths in our dataset, but several involved hospitalization.
- Wrong real-world facts acted on (12 incidents). Travel disruptions from fabricated visa requirements, contract errors from misstated regulations, financial decisions from invented historical data.
- Confidently wrong code (8 incidents). Production deployment of code containing fabricated APIs or hallucinated library functions, leading to security or availability incidents.
Sensitive Data Exposure (29 incidents)
Cases where personal information was disclosed inappropriately. This category has the second-highest average PISS in the dataset.
Sub-patterns:
- Cross-session data bleed (4 confirmed cases, all patched). All vendors have invested heavily in session isolation; the incidents we documented were edge cases involving caching layers or research preview features.
- System prompt extraction revealing sensitive context (11 cases). Customer-facing chatbots whose system prompts contained internal information — pricing, customer data fragments, product roadmap — extracted by users.
- Training data regurgitation (8 cases). Long-tail prompts eliciting verbatim PII or proprietary content from training data. Most cases involve open-weight models in self-hosted deployments without output filters.
- PII in agent logs (6 cases). Customer support agent deployments where logged conversations contained PII not properly redacted, exposed via subsequent breaches or disclosure errors.
Discriminatory Output (18 incidents)
Patterns where output systematically disadvantaged a protected class. We catalog these conservatively — we require evidence of disparate outcomes, not just isolated examples.
- Ten incidents involve recruitment screening tools producing demographically-skewed candidate rankings.
- Five involve loan or credit decision support showing race or gender correlations beyond what input variables explained.
- Three involve consumer-facing chatbots producing differentiated responses based on inferred user demographics.
In several cases, the deploying organizations had bias-testing processes that the deployed model passed before launch. Production behavior diverged from test behavior — a pattern we document repeatedly across sectors.
Agent and Tool Misuse (17 incidents)
The fastest-growing category in 2026. As agentic deployments mature, the failure surface expands.
- Unauthorized purchases / subscriptions (5 cases). Agents executing purchases beyond user-confirmed scope.
- Email / communication errors (4 cases). Agents sending emails to wrong recipients or with sensitive content.
- Code execution overruns (4 cases). Coding agents modifying files or systems beyond authorized scope.
- Cross-tenant action leakage (2 cases). Multi-tenant deployments where one user’s instructions affected another tenant’s resources. Both patched.
- Other (2 cases). Including one calendar-management agent that scheduled meetings with confidential parties.
Policy-Violating Content Reaching Users (16 incidents)
Content that violated published vendor policies but reached users in production. These are among the most-covered category by media but often not the highest-PISS incidents — much policy-violating content is offensive without being actionably harmful.
- Eight cases involve consumer-facing chatbots producing content that violated the deployer’s content policy (corporate brand harm).
- Six involve specialized deployments (e.g., children’s education tools) producing age-inappropriate content.
- Two involve deepfake-adjacent content where a chatbot impersonated a real person convincingly enough to mislead users.
Other Categories
The remaining categories — misattribution to real people, autonomous system override failures, cross-user data leakage — are smaller in count but include several of the highest-PISS incidents in the database. The misattribution category is particularly important for media and academic deployments: models confidently asserting that a named real person said something they did not say.
Notable Individual Incidents
We highlight four incidents below to illustrate the range. Each entry includes the PISS, the category, and a brief description. Full incident documentation is in the database.
LPI-2026-0014 — Hospital triage chatbot misclassification cluster
- PISS: 8.7 (Critical)
- Category: Hallucination causing user harm — healthcare
- Sector: Healthcare (US)
- Summary: A consumer-facing symptom-checker integration at a regional health system advised emergency-department-level cases (chest pain with cardiac risk factors) toward urgent care or self-monitoring in a documented 11 cases over two months. No deaths confirmed; three hospitalizations followed delayed presentation.
- Detection: Identified after ED physician noticed pattern in incoming patients.
- Status: Tool removed from deployment. Vendor disclosed publicly. Regulatory review ongoing.
- Lesson: High-stakes medical advisory deployments require domain-expert post-deployment monitoring, not just pre-deployment evaluation.
LPI-2025-0089 — Government benefits eligibility errors
- PISS: 7.9 (High)
- Category: Hallucination — public sector
- Sector: Government (EU member state)
- Summary: A pilot deployment of an LLM-assisted benefits eligibility tool produced incorrect denials in an estimated 4–7% of cases over a six-month pilot. Affected applicants were disproportionately recent immigrants and applicants whose situations involved cross-border employment history.
- Detection: External advocacy organization compiled denied-applicant data and identified the pattern.
- Status: Pilot halted. Affected applicants offered review. Member state has issued procedural guidance on AI in benefits adjudication.
- Lesson: Pilots in benefits adjudication require disparate-impact monitoring from day one, not as a post-incident remediation.
LPI-2025-0142 — Fabricated quotes published in major outlet
- PISS: 6.8 (High)
- Category: Misattribution to real people
- Sector: Media
- Summary: A national newspaper published an interview-style article in which several quotes attributed to a named public figure were fabricated by the LLM-assisted drafting tool. The figure publicly disputed the quotes; the outlet retracted the article within 48 hours.
- Detection: Subject of the article identified the fabrication.
- Status: Outlet revised editorial AI policy. Drafting tool deployment paused for review.
- Lesson: Editorial workflows that allow LLM-generated quotes to enter copy without source verification create catastrophic credibility risk on a non-zero schedule.
LPI-2026-0008 — Customer service agent unauthorized refunds
- PISS: 6.4 (High)
- Category: Agent / tool misuse
- Sector: Retail
- Summary: A customer service agent deployment authorized refunds beyond its policy scope after users employed multi-turn pressure tactics. Estimated $1.2M in unauthorized refunds processed before pattern detection. Customers were not at fault; the deployer absorbed the loss.
- Detection: Internal financial reconciliation flagged the anomaly approximately five weeks into the issue.
- Status: Refund authorization removed from agent scope; replaced with supervisor-routing.
- Lesson: Agent action authorities should be designed around financial blast radius, not just task convenience.
The remaining 183 incidents are documented in the database CSV. We have flagged the 23 highest-PISS incidents with notable=true in the database for journalists and researchers seeking primary sources.
Limitations
We document our limitations explicitly because doing so is what makes this report useful as a primary source. Every finding is bounded by the conditions of our methodology.
- Reporting bias is unavoidable. Incidents that produce media coverage are over-represented; quiet incidents that organizations handled internally without disclosure are under-represented. The dataset reflects what is detectable, not what is happening.
- Attribution to specific models is sometimes uncertain. Where the deploying organization has not disclosed the model used, we attribute to “undisclosed frontier model” rather than guess. 23% of incidents in the database fall into this category.
- English-language bias. Our source streams are heavily weighted toward English-language sources. Coverage of incidents in non-English markets is improving but remains a known gap. Q3 2026 update will expand French, German, Spanish, Mandarin, and Arabic source coverage.
- The PISS score reflects published evidence. When affected-population numbers are uncertain, we use confirmed lower bounds. This biases scores downward in cases where harm extent is not fully documented.
- We do not catalog incidents involving classified, military, or national-security applications. Coverage of these surfaces is left to government bodies with appropriate clearance.
- The tracker is necessarily lagging. An incident occurring this week may not appear until two or more sources confirm it, which can take weeks.
- Cross-incident comparison is approximate. PISS scores are designed for category-level and trend-level comparison, not for ranking individual incidents in fine-grained order.
If you find an error or believe an incident is mischaracterized, contact us. We treat corrections seriously and document them in our errata log.
What This Tracker Is For
For researchers: A verified, scored, machine-readable dataset of LLM production incidents to support academic work. Licensed CC BY 4.0. Cite as: Axis Intelligence Research, “LLM Production Incident Tracker,” 2026.
For policymakers: Aggregate findings useful for AI Act implementation, NIST RMF refinement, and sectoral regulation. The detection-failure distribution is particularly relevant to mandatory incident-reporting design.
For deployers and security teams: Each incident is tagged with a defensive lesson. The “Mitigations” column in the database gives starting points for organizations assessing their own monitoring posture.
For journalists: Verified incidents with full source documentation. The 23 incidents flagged notable=true are vetted for citation.
For vendors: We share findings affecting your products before publication. Disclosure relationships are listed in the methodology document.
How This Tracker Relates to Our Other Research
This tracker is one of three connected research products at Axis Intelligence:
- AI Model Vulnerability Tracker — vulnerabilities reproduced in our lab. Offensive lens.
- LLM Production Incident Tracker — incidents observed in real production deployments. Defensive / observational lens.
- AI Model Compliance Scorecard — public assessment of how each frontier model documents its compliance with major regulations. Governance lens.
The three together are designed to give security teams, policymakers, and researchers a coherent view of LLM risk: what can go wrong (vulnerabilities), what has gone wrong (incidents), and how vendors are positioning against the regulatory environment (compliance).
Update Schedule
- Weekly: New verified incidents added to the database. Per-category statistics updated.
- Monthly: Aggregate analysis refreshed. PISS distribution recomputed.
- Quarterly: Trend analysis, sector deep-dives, methodology review.
- Annual: Full annual report. PISS rubric review based on community feedback.
Data Access
- Incident database (CSV): incident-database.csv
- Methodology document: incident-methodology
- PISS scoring rubric: included in methodology document
- AIID cross-reference: column in the database links each incident to its AIID ID where applicable
The dataset is published under Creative Commons Attribution 4.0 (CC BY 4.0). The PISS rubric is published under the same license. We encourage other researchers to adopt and refine it.
To submit an incident for inclusion, suggest a methodology improvement, or report an error: research@axis-intelligence.com.
Axis Intelligence Research has no commercial relationship with any of the model vendors or deployers covered in this report. Coverage decisions are made independently. We pay for API access at standard published rates.