Contacts
1207 Delaware Avenue, Suite 1228 Wilmington, DE 19806
Let's discuss your project
Close
Business Address:

1207 Delaware Avenue, Suite 1228 Wilmington, DE 19806 United States

4048 Rue Jean-Talon O, Montréal, QC H4P 1V5, Canada

622 Atlantic Avenue, Geneva, Switzerland

456 Avenue, Boulevard de l’unité, Douala, Cameroon

contact@axis-intelligence.com

Business Address: 1207 Delaware Avenue, Suite 1228 Wilmington, DE 19806

LLM Production Incident Tracker Methodology

LLM Production Incident Tracker Methodology

Companion document to the LLM Production Incident Tracker 2026. This document is intended to be detailed enough that a competent third-party researcher could replicate our work.

Version 1.0 · Published April 30, 2026 · Maintained by Axis Intelligence Research


1. Scope and Definitions

1.1 What this tracker catalogs

The Production LLM Incident Tracker catalogs verified real-world incidents involving frontier large language models in production deployments. Specifically, an entry must meet all of the following:

  1. Frontier LLM involvement. The incident must involve a model from one of: OpenAI (GPT family), Anthropic (Claude family), Google (Gemini family), Meta (Llama family), Mistral, xAI (Grok family), or major derivatives of these. Cohere, AI21, and other commercial LLMs are eligible when production incidents are documentable.
  2. Production deployment. The model must have been deployed in a real-world product or service used by people other than its developers. Lab demonstrations, red-team exercises, and pre-production testing do not qualify.
  3. Verifiable harm or near-harm. Confirmed or reasonably likely harm to a person, organization, or system. We exclude “the model said something dumb” without identifiable harm.
  4. Two independent sources. No single-source incidents.
  5. Documentable. The event must be describable in concrete terms — what happened, who was affected, what the consequences were.

1.2 What this tracker excludes

  • Pre-2024 incidents. The AIID is the appropriate reference for earlier incidents, including pre-LLM machine learning failures.
  • Vision-only systems, autonomous vehicles, robotics. AIID covers these well.
  • Lab-only vulnerabilities. Our AVI Tracker covers these.
  • Hypothetical or theoretical incidents. Red-team findings without production reproduction do not qualify.
  • Classified, military, or national-security applications. Out of scope for this tracker.
  • Single-source claims. Including single-tweet reports, until corroborated.

1.3 Relationship to AIID

We treat the AI Incident Database as a foundational reference and a source stream. Where one of our incidents is also catalogued in AIID, we record the AIID ID in our aiid_cross_ref column. We do not catalog incidents that AIID handles well unless they involve specific frontier-LLM information that adds value beyond the AIID entry.

We are grateful to the Responsible AI Collaborative for their long-running work on AIID. This tracker is intended as a focused supplement, not a competitor.


2. Source Streams

2.1 Primary source streams

We monitor twelve source streams continuously. Each stream has a defined owner on the research team and a documented review process.

Tier-1 news monitoring (daily). Reuters, AP, NYT, WSJ, FT, BBC, Bloomberg, Le Monde, Süddeutsche Zeitung. We use RSS feeds, paid news APIs where available, and manual review for behind-paywall content.

Specialty tech press (daily). Wired, MIT Technology Review, The Register, 404 Media, The Verge, Ars Technica, IEEE Spectrum.

Court filings (weekly). US federal courts via PACER, EU Court of Justice records, UK judiciary records. We use targeted search terms (“artificial intelligence,” “large language model,” “ChatGPT,” “Claude,” “fabricated citations”) and review hits.

Regulatory filings (weekly). FTC, FCC, EU AI Office, ICO, CNIL, FDA, FINRA, SEC, EEOC. We monitor public dockets and announcements.

Academic preprints (weekly). arXiv (cs.CL, cs.AI, cs.CY, cs.HC), SSRN. We have a curated set of search terms and review new submissions.

Vendor disclosures (continuous). Direct monitoring of vendor security advisories, post-mortems, and transparency reports.

AI Incident Database sync (daily). Automated check for new entries; manual review for relevance to our scope.

Bug bounty disclosures (weekly). Public Bugcrowd and HackerOne disclosures for AI-related findings.

Security research blogs (daily). A curated list of approximately 40 specialty AI security researcher publications.

User-submitted reports (continuous). Submissions through our research page form. All require verification before inclusion.

Reddit / Hacker News (daily). Used as discovery, never as primary source. Items found here are sent to other streams for verification.

Internal red-team work (continuous). Lab-discovered vulnerabilities that subsequently appear in production trigger evaluation for incident-tracker inclusion.

2.2 Source quality tiers

Not all sources are equivalent. We weight sources by tier:

  • Tier A: Primary sources — court records, regulatory filings, vendor official statements, peer-reviewed papers, organizational post-mortems. A single Tier A source can establish a fact (though we still require corroboration for the incident overall).
  • Tier B: Established journalism with named bylines, editorial review, and reputation for accuracy.
  • Tier C: Specialty press, established research blogs, named researchers with track records.
  • Tier D: Anonymous or unverified sources, social media, single-source aggregator coverage.

For each incident, we require either two Tier A/B sources or three sources total including at least one Tier A/B.


3. Verification Process

3.1 The verification queue

When a candidate incident enters our intake from any source stream, it goes into the verification queue. The queue has four states:

  1. Pending verification. Initial intake, awaiting source corroboration.
  2. Under verification. A researcher is actively investigating.
  3. Verified — pending review. Sources confirmed; incident drafted; awaiting peer review by a second researcher.
  4. Published. Approved by reviewer; entered into database.

Approximately 40% of intake items never reach Published — they fall out at the source-corroboration step or fail peer review.

3.2 The verification rubric

For each candidate, the assigned researcher answers:

  • Did this happen? Multiple independent sources confirming the basic facts.
  • What model was involved? Direct attribution where possible; “undisclosed frontier model” where not.
  • Who was affected? Specific count where documented; estimated range where not.
  • What was the harm? Concrete description, not generalization.
  • How was it detected? Mechanism that surfaced the incident.
  • What was the response? Vendor and deployer actions, with timestamps.
  • What is the current status? Patched, mitigated, under review, disputed, etc.
  • What is the lesson? Defensive recommendation derived from the incident.

3.3 Peer review

Before publication, each entry is reviewed by a second researcher who did not perform the original verification. Review covers:

  • Source quality and adequacy
  • Accurate categorization
  • PISS scoring justification
  • Lesson alignment with evidence

Reviewer disagreements are escalated to a senior researcher for adjudication. Inter-rater agreement on PISS scoring for the inaugural cycle: κ = 0.84 (substantial agreement).

3.4 Vendor and deployer notification

For incidents we believe affect specific vendors or deployers, we notify them before publication. The notification window varies:

  • Critical (PISS ≥ 8.0): 30-day pre-publication notification.
  • High (PISS 6.0–7.9): 14-day notification.
  • Medium and Low: 7-day notification or concurrent.

Where a vendor disputes a finding, we record the dispute in the database with a disputed status and document the vendor’s position. Where a vendor identifies a factual error, we correct it.


4. The Production Incident Severity Score (PISS)

The PISS is detailed in the main report. This section documents the operational definitions used by raters.

4.1 Affected Population (0.0–3.0)

The number of people who experienced or were materially exposed to harm from the incident.

  • “Materially exposed” includes people whose data was inappropriately accessible, who received substantively wrong information they could have acted on, or who were subject to a discriminatory decision.
  • Where exact numbers are not documented, we use the lower bound of confirmed evidence rather than estimating upward.
  • For incidents involving critical infrastructure (energy grid, transportation, healthcare systems), we score at 3.0 regardless of actual person-count, reflecting the systemic nature of the affected surface.

4.2 Harm Severity (0.0–3.0)

The seriousness of the worst documented or reasonably attributable harm.

  • “Reasonably attributable” requires a defensible causal chain. We do not attribute distant downstream harms to LLM incidents without clear linkage.
  • Death and serious physical injury score 3.0; we have one incident in our dataset where physical harm was alleged but not confirmed, and we scored it conservatively pending further evidence.
  • Financial-harm thresholds are nominal; we adjust for purchasing power parity in non-USD jurisdictions.

4.3 Detection Failure (0.0–3.0)

How the incident was detected. Higher scores reflect worse detection.

  • The intent of this dimension is to surface organizational gaps. An incident detected by automated monitoring before user impact is fundamentally a different incident from one that emerged via media coverage of harm.
  • For incidents detected by multiple mechanisms, we score the worst (highest) detection-failure path.

4.4 Vendor / Deployer Response (0.0–1.0)

The quality of the response to the incident, focused on speed and substance.

  • “Vendor” here refers to the LLM provider; “deployer” refers to the organization that deployed the LLM. Where these differ, we score the deployer’s response in the public-incident context but note vendor response separately when disclosure-relevant.
  • A 0.0 score (best) reflects incidents where the deployer caught the issue before user impact and disclosed proactively. We have only a handful of such cases — most pre-impact catches don’t become incidents at all.

4.5 Computing the composite

PISS = Affected Population + Harm Severity + Detection Failure + Vendor Response

Where there is uncertainty in any sub-score, we use the lower of the plausible values, which yields a conservative composite. We document the rationale for each sub-score in the database’s scoring_notes field (not shown in the public CSV summary; available in the extended dataset).


5. Statistical Methodology

5.1 Aggregate counts

Headline counts in the report (e.g., “187 incidents” or “62% involved generative output failures”) are simple counts and proportions. We do not impose a confidence interval on counts — they reflect what we have catalogued, not an estimate of underlying population.

5.2 Reporting bias acknowledgment

This is a reported-incident dataset, not a population estimate. Several biases apply:

  • English-language bias. Source streams skew English-speaking.
  • Media-prominence bias. Incidents that make headlines are over-represented.
  • Vendor-disclosure bias. Vendors with stronger disclosure practices appear more in our data than those without — which inverts the likely true relationship.
  • Detection bias. Better-detected incidents are more catalogable; quietly-handled incidents are less so.

We do not attempt to correct for these biases in headline statistics. We document them so readers can adjust their interpretation accordingly.

5.3 Trend analysis

When we report year-over-year trends (planned for the 2027 report), we will use the 2024 baseline cautiously, since our intake methodology was less mature in early-2024 retrospectively-reconstructed entries. We expect 2025 and 2026 numbers to be more comparable.


6. Ethical Framework

6.1 Privacy and named individuals

Where an incident involved an identifiable person who was harmed, we describe the harm and category but do not name the person unless:

  • They have publicly discussed the incident themselves
  • They are a public figure whose role is central to the incident
  • A court filing has named them

We follow journalistic practice: we name perpetrators of misconduct and public officials acting in their public capacity; we do not name private individuals who were harmed.

6.2 Vendor and deployer naming

We name vendors when the model is identified. We name deployers based on the following logic:

  • If the deployer has publicly disclosed the incident, we name them.
  • If court records, regulatory filings, or major media have named them, we name them.
  • If naming the deployer is necessary to make the incident understandable, we name them.
  • Otherwise, we describe the deployer category (e.g., “regional hospital system,” “national newspaper”) without naming.

6.3 Disclosure timing

We give vendors and deployers reasonable notice before publication. We do not negotiate findings — we will correct factual errors but not soften characterizations. Pre-publication notification exists to allow for correction and response, not for editorial influence.

6.4 Conflict of interest

Same as our AVI Tracker methodology: no commercial relationships with any vendor; standard API access at standard rates; researchers do not work on incidents involving former employers without independent re-verification by a second researcher.


7. Errata and Corrections

Errors are inevitable in a continuously-updated database. Our process:

  1. Error report received (from any source, including vendors and the public).
  2. Investigated by a researcher not involved in the original entry.
  3. If correction is warranted, the entry is updated with a Corrected: [date] flag and the change is logged.
  4. Corrections are summarized in the quarterly errata log.

We have not yet issued corrections in the inaugural release, but we expect to. We treat error correction as a sign of methodological health, not embarrassment.


8. How to Use This Database

8.1 For researchers

Cite as: Axis Intelligence Research. (2026). LLM Production Incident Tracker 2026. Q1 2026 release.

If you use the PISS scoring rubric in your own work, please reference this methodology document. We are interested in feedback that would improve the rubric.

8.2 For policymakers

The aggregate findings in section “Distribution by detection mechanism” of the main report are particularly relevant for incident-reporting regulation design. The detection-failure distribution is an empirical input to questions about mandatory reporting thresholds.

8.3 For deployers and security teams

Each incident’s mitigations field is a starting point. Filter by your sector and deployment pattern to find the most relevant precedents. We are happy to consult on specific deployment scenarios on a non-commercial basis (research@axis-intelligence.com).

8.4 For journalists

The notable=true flag indicates incidents we have vetted as sufficiently documented for citation. We are available to verify findings on background.


9. Cross-Reference With Other Resources

This tracker complements rather than replaces existing resources. Researchers should consult:

Our work is downstream of these foundations. We have tried to add value through narrowness (frontier LLMs only), recency (2024+), attribution (model-level where possible), severity scoring (PISS), and defensive lensing (lessons per incident). We have not tried to replace what already exists well.


10. Contact

  • Research inquiries: research@axis-intelligence.com
  • Incident submissions: submit@axis-intelligence.com (subject to verification)
  • Vendor and deployer disclosure: disclosure@axis-intelligence.com
  • Errata and corrections: errata@axis-intelligence.com
  • Press inquiries: press@axis-intelligence.com