GPT-5.4 Is Here with 1 Million Tokens

Published: March 8, 2026 | Author: Axis Intelligence Staff
Category: AI | Reading time: ~10 min

OpenAI just shipped GPT-5.4 on March 5, 2026 — and for once, the headline number undersells the real story. A 1-million-token context window sounds like an incremental spec bump. It isn’t. Combined with native computer use, a 33% reduction in hallucinations, and a new tool-search architecture that cuts token costs by 47%, GPT-5.4 is the first model that can plausibly replace a human knowledge worker for an entire multi-hour workflow — not just a single task.

The question isn’t whether GPT-5.4 is impressive. It is. The question is what it actually changes for the people using AI today — and what it means for the companies building on top of it.

Here’s the honest answer, based on everything published in the 72 hours since launch.

The Context War Is Over — OpenAI Finally Showed Up

For the past twelve months, the most embarrassing gap in OpenAI’s flagship lineup was context length. While Google’s Gemini 3 Pro shipped with a production-ready 1-million-token window and Anthropic’s Claude Opus 4.6 hit the same milestone in beta, GPT-5.2 was stuck at 400,000 tokens. That’s not a rounding error. It’s the difference between loading a full enterprise codebase and loading a fragment of one.

GPT-5.4 closes that gap — and then some. The API and Codex versions now support up to 1.05 million tokens of context (technically 922K input, 128K output). That matches Google in raw numbers, and surpasses Anthropic’s standard context window by 5x.

But there’s a catch worth knowing. The 1-million-token window is currently experimental in Codex and the API. In ChatGPT itself — the version most non-developers use — the context window for GPT-5.4 Thinking remains unchanged from GPT-5.2 Thinking. If you’re expecting to paste 750 pages of documents into the chat interface and have a conversation about all of them, you’ll need to use the API or wait for broader rollout.

The pricing structure also reflects the scale. Prompts exceeding 272K input tokens are billed at 2x the standard input rate and 1.5x output for the full session. So while the capability is real, the economics of very long context aren’t free.

What a Million Tokens Actually Means in Practice

Most articles about context windows default to abstract comparisons — “equivalent to 750,000 words” or “the length of 10 novels.” Those numbers are accurate but don’t capture why this matters operationally.

Here’s what changes at 1 million tokens that wasn’t possible at 400,000:

Whole-codebase reasoning. A mid-sized production codebase — 300,000 to 600,000 lines of code across hundreds of files — now fits in a single context. Instead of chunking, summarizing, and losing inter-file relationships, an agent can reason across the entire system simultaneously. GitHub’s Chief Product Officer Mario Rodriguez said in OpenAI’s launch materials that GPT-5.4 “performs exceptionally well at logical reasoning and executing intricate, multi-step, tool-dependent workflows” — and a unified codebase context is exactly the environment where that matters most.

Full document stacks. Legal due diligence often involves hundreds of contracts. Medical research requires synthesizing dozens of clinical papers. Financial modeling involves cross-referencing quarterly reports across years. At 1 million tokens, these entire document sets can be passed as a single context, eliminating the retrieval-augmented generation workarounds that introduced errors and latency.

Long-horizon agent sessions. An agent running a multi-hour workflow — researching, drafting, editing, then filing a report — generates a long conversation history. At 400K tokens, that history would overflow, forcing session resets and losing state. At 1 million tokens, the agent can maintain continuity across a full workday of autonomous operation.

The practical benchmark that best captures this shift is MRCR v2, which tests 8-needle retrieval inside a 1-million-token context. Claude Opus 4.6 scored 76% on that benchmark in beta — meaning it can find and synthesize information scattered across an entire enormous document with reasonable reliability. GPT-5.4’s performance on this specific benchmark hasn’t been published yet, which is worth flagging. Context size and context quality are different things. Gemini 3.1 Pro, which has offered production-grade 1-million-token context the longest, has established the most real-world track record at this scale.

The Benchmark Convergence Nobody’s Talking About

Here’s what the launch coverage largely missed: the three frontier models — GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro — are now scoring within 2-3 percentage points of each other on most major intelligence benchmarks. Artificial Analysis currently ranks GPT-5.4 (xhigh reasoning) and Gemini 3.1 Pro Preview tied at an Intelligence Index score of 57. Claude Opus 4.6 at maximum effort sits just behind at 53.

That convergence is the real story of early 2026. The frontier has effectively plateaued on raw capability metrics. What differentiates these models now isn’t intelligence — it’s architecture, cost, and what they’re purpose-built to do well.

GPT-5.4 leads on knowledge-work tasks (83% on OpenAI’s new GDPval benchmark across 44 professional occupations) and computer use (75% on OSWorld). Claude Opus 4.6 leads on coding precision (80.8% SWE-Bench) and web research (84% BrowseComp for standard). Gemini 3.1 Pro leads on abstract reasoning (77.1% ARC-AGI-2) and cost-efficiency ($2/$12 per million tokens input/output versus GPT-5.4’s $2.50/$20).

If you were expecting GPT-5.4 to be a decisive, across-the-board winner, the honest answer is: it depends entirely on your use case. And that’s a more interesting reality than any clean “this model wins” headline.

The Bigger Deal Than the Context Window: Native Computer Use

If the 1-million-token context window is the headline, native computer use is the actual shift that changes what AI can do in the world.

GPT-5.4 is OpenAI’s first general-purpose model capable of operating a computer the way a human does — navigating file directories, clicking through browser interfaces, pulling real-time data from websites, executing workflows across disparate applications. It scored 75% on the OSWorld-Verified benchmark for computer use, currently a state-of-the-art result among publicly benchmarked models.

To be precise: this capability is available in the API and Codex, not in the consumer ChatGPT interface. OpenAI frames it as targeted at developers building autonomous agents, not at everyday users. But the downstream effect on how software gets built — and how much human labor gets automated — is immediate and significant.

What “Computer Use” Means for Real Workflows

The standard use case examples OpenAI and Microsoft have published include document drafting, spreadsheet modeling, and code generation. Those are real. But they’re also the safe, enterprise-friendly framing. The more revealing picture comes from what the benchmark scores actually measure.

OSWorld-Verified tests an agent’s ability to complete tasks on a real desktop operating system — opening files, navigating applications, completing multi-step sequences without human intervention. A 75% score means GPT-5.4 completes three-quarters of those tasks correctly without a human in the loop.

For context: twelve months ago, no publicly available model could complete those tasks reliably at all. The jump from “model that writes code” to “model that operates the computer to run that code, check the output, and update the document” isn’t a marginal improvement. It’s a category change.

Microsoft’s Azure Foundry documentation frames this directly: GPT-5.4 was designed for “longer, more complex workflows where consistency and follow-through become as important as raw intelligence.” The phrasing is telling. The competition isn’t other AI models anymore — it’s the human knowledge workers who currently provide that consistency and follow-through.

The Tool Search Architecture That Makes Agents Economically Viable

The context window and computer use get the attention, but there’s a third structural change in GPT-5.4 that matters as much for anyone building AI agents: Tool Search.

Previously, when an AI agent was given access to a set of tools — APIs, databases, external services — all of the tool definitions had to be included in every prompt upfront. For a typical enterprise deployment with dozens of integrations, this meant tens of thousands of tokens of overhead on every single request. At scale, that overhead became prohibitively expensive.

GPT-5.4 flips this model. Instead of loading all tool definitions upfront, the model receives a lightweight index of available tools. When it actually needs a tool, it looks up the full definition at that moment and appends it. OpenAI tested this against 250 tasks from Scale’s MCP Atlas benchmark with all 36 MCP servers enabled. The tool-search configuration reduced total token usage by 47% with no measurable accuracy loss.

A 47% reduction in token overhead isn’t a convenience feature — for enterprise deployments running millions of agent tasks per month, it’s the difference between a financially viable system and one that burns the budget before it generates value.

The Hallucination Numbers (And Why 33% Still Isn’t Zero)

OpenAI reports GPT-5.4 is their most factual model to date: individual claims are 33% less likely to be false compared to GPT-5.2, and full responses are 18% less likely to contain errors. Those are meaningful improvements for professional use cases where accuracy matters.

But the framing deserves scrutiny. “33% less likely to be false” sounds significant. It means if GPT-5.2 had, say, a 9% rate of false claims, GPT-5.4 has a 6% rate. For high-stakes documents — legal briefs, financial models, medical summaries — a 6% error rate still requires mandatory human review. One independent developer testing the model in agentic workflows noted that GPT-5.4 occasionally executed tasks incorrectly without flagging the error. That pattern — confident wrong execution — is the failure mode that actually matters in autonomous workflows.

The 83% GDPval score is the most impressive benchmark in the launch package. OpenAI’s GDPval benchmark measures model performance on real knowledge-work tasks across 44 professional occupations — actual deliverables like spreadsheets, slide decks, and legal briefs created by experienced professionals. Scoring 83% means GPT-5.4 matches or exceeds professional-level output on the majority of those tasks. That’s a genuine capability milestone.

It also means 17% of those tasks produce outputs that fall below the professional threshold. In knowledge work, that’s not an edge case — that’s a structural workflow requirement for human oversight on roughly one in six outputs.

The right mental model for GPT-5.4 in enterprise contexts isn’t “this replaces the analyst.” It’s “this handles the first draft on 83% of tasks, and an analyst reviews everything.” That still represents a profound shift in leverage — but it’s not the fully autonomous future that some of the launch coverage implied.

The Three Industries GPT-5.4 Disrupts First

The combination of a 1-million-token context window, computer use, and 83% professional-task accuracy doesn’t affect all industries equally. Three sectors are in the direct line of fire.

Legal

Legal work is almost perfectly designed for 1-million-token context. A complex M&A due diligence process involves reviewing hundreds of contracts, financial filings, regulatory documents, and correspondence — often totaling well over 500,000 tokens. The old AI workflow involved chunking those documents, processing them in batches, and manually assembling the results. Every chunking step introduced the possibility of missing cross-document relationships.

At 1 million tokens, a single GPT-5.4 agent can ingest the full document stack, identify contradictions between contracts, flag risk clauses, and produce a structured due diligence report in a continuous workflow. The GDPval benchmark explicitly includes legal briefs as one of its 44 professional task categories — and GPT-5.4 matched or exceeded professional performance on the majority of them.

The firms that move fastest here gain a structural cost advantage. LexisNexis, one of the legal industry’s largest AI infrastructure providers, confirmed a major data breach in early March 2026 — an unrelated but telling sign that legal data management is becoming a high-value target as AI capability in this domain accelerates.

Finance

OpenAI’s timing on the launch wasn’t coincidental. Alongside GPT-5.4, they announced ChatGPT for Excel and Google Sheets in beta — an embedded version that can build, analyze, and update complex financial models directly inside spreadsheets. They also launched new integrations with FactSet, MSCI, Third Bridge, and Moody’s, designed to let teams pull market data, company data, and internal data into a single AI workflow.

For financial analysts, this is the workflow that matters: pulling data from multiple sources, reconciling it, running sensitivity analyses, and producing client-ready output. GPT-5.4 can now do all of that in a single session, with tool search ensuring that each data connector is invoked efficiently rather than loaded upfront.

The economic pressure on mid-level analyst roles is real. McKinsey, which employs 40,000 human consultants, revealed in early 2026 that it now runs 25,000 AI agents alongside its human workforce — a number that had been just 3,000 agents eighteen months earlier. That’s not a warning sign on the horizon. It’s already happening at scale.

Software Development

GPT-5.4 incorporates the coding engine from GPT-5.3-Codex, and the combined result is the most capable coding model OpenAI has shipped to date. On SWE-Bench Pro (which measures performance on real-world software engineering tasks), GPT-5.4 scores 57.7%. On Toolathlon, which tests agentic tool use, it leads the field at 54.6% versus Claude Sonnet 4.6’s 44.8%.

The /fast mode in Codex delivers up to 1.5x faster token velocity on GPT-5.4, keeping development iteration times competitive with specialized coding models.

The practical impact: a software development team using GPT-5.4 through Codex can now load an entire production codebase as context, have the agent identify the relevant files, make changes across multiple files simultaneously, run the code to verify output, and document the changes — all without a developer manually navigating the process. Block, the payments company led by Jack Dorsey, announced a 40% headcount reduction in early March 2026, with AI productivity cited as a primary driver. Block’s stock jumped over 20% on the announcement.

That market reaction tells you what investors think about the economics of AI-assisted engineering at scale.

Who Gets Hurt, Who Gets Leverage

The economic impact of GPT-5.4 isn’t uniformly distributed. It creates significant leverage for some and structural pressure for others.

Gains leverage: Senior professionals who can direct and validate AI output, and who understand where the 17% failure rate is most likely to occur. Domain experts who can use the 1-million-token context to run analyses that were previously impossible in real time. Small teams at startups who can now compete with larger teams on knowledge-work output without equivalent headcount.

Faces structural pressure: Mid-level knowledge workers whose primary value is executing well-defined processes — document review, financial modeling, code generation, research synthesis. These are exactly the task categories where GPT-5.4’s 83% GDPval score applies. The DeepL study cited in the period around GPT-5.4’s launch found that 69% of executives worldwide expect AI agents to significantly change their business processes in 2026. A separate Anthropic-commissioned survey found that 57% of companies are already using AI agents for multi-step workflows, with 81% planning to increase complexity by year-end.

The open question: Whether the productivity gains translate into hiring slowdowns, role compression, or outright displacement depends on each company’s choices — and on how fast regulatory frameworks catch up with what these systems can now do autonomously.

The Competitive Picture After GPT-5.4

GPT-5.4 is the best all-around frontier model OpenAI has ever shipped. It’s not the best model at everything. Here’s the honest state of the competition as of March 8, 2026:

GPT-5.4 leads on knowledge-work (83% GDPval), computer use (75% OSWorld), agentic tool use (54.6% Toolathlon), and professional document production. It’s the strongest choice for enterprise workflows that require a model to operate autonomously across multiple tools and applications.

Claude Opus 4.6 from Anthropic leads on coding precision (80.8% SWE-Bench) and web research (84% BrowseComp). It’s the better choice for teams who need nuanced debugging, complex architectural changes, or writing quality that reads like a human expert. Anthropic is also navigating a politically complicated period — the US government began pulling Anthropic contracts from multiple agencies in early March following a Pentagon dispute over weapons targeting use cases — but the technical capability isn’t in question.

Gemini 3.1 Pro from Google leads on abstract reasoning (77.1% ARC-AGI-2), science benchmarks (94.3% GPQA Diamond), and cost efficiency. At $2/$12 per million tokens input/output versus GPT-5.4’s $2.50/$20, Gemini is roughly 40% cheaper for equivalent intelligence scores. It’s the strongest choice for high-volume API workloads where cost-per-task matters more than maximum performance on any single task. Gemini also still leads on aggregate context handling, with a 2-million-token window that handles video and audio natively.

The convergence at the frontier is real. The Artificial Analysis Intelligence Index ranks GPT-5.4 (xhigh) and Gemini 3.1 Pro Preview tied at 57, with Claude Opus 4.6 at 53. For most enterprise workloads, the right answer in 2026 isn’t “which single model do we use” — it’s “how do we route tasks to the right model at the right cost.”

ChatGPT’s market position is also worth noting: between August 2025 and February 2026, its US daily active user share fell from 57% to 42%. Google Gemini doubled to 25% over the same period. Claude tripled to 4%. GPT-5.4 is OpenAI’s most important competitive response to that erosion — and whether it succeeds will be visible in Search Console Discover data over the next 90 days more than in any benchmark.

What This Means for You Right Now

Whether you’re a developer, a business user, or someone paying $20/month for ChatGPT Plus, GPT-5.4 changes your situation differently. Here’s the honest breakdown.

If You Use ChatGPT on a Paid Plan

GPT-5.4 Thinking is available now to Plus, Team, and Pro subscribers. You’ll see it in the model picker, replacing GPT-5.2 Thinking as the default. The practical difference in daily use: better performance on complex reasoning, transparent thinking chains you can review and redirect mid-response, and improved deep web research on specific queries.

What you don’t get on the consumer tier: the 1-million-token context window. That remains API/Codex-only for now. So if you’re a ChatGPT Plus user who was hoping to paste an entire large document stack into the chat, you’ll need to wait.

GPT-5.2 Thinking remains available under Legacy Models for three months, retiring June 5, 2026. If you’ve built workflows around specific GPT-5.2 Thinking behaviors, you have a 90-day transition window.

If You’re Building on the API

GPT-5.4 is live at $2.50 per million input tokens and $20 per million output tokens (standard). Cached input is $0.625 per million. The 1-million-token context is available with experimental support in Codex; configure it via model_context_window and model_auto_compact_token_limit parameters.

The most important architectural shift: implement Tool Search. OpenAI’s benchmark showed a 47% reduction in token usage with no accuracy loss when tools are placed behind the tool search layer rather than loaded upfront. For any production deployment with more than a handful of integrations, this alone can meaningfully reduce your monthly cost.

The computer use capability is available through the updated computer tool in the API. OpenAI’s documentation recommends high image detail settings for improved click accuracy on desktop interfaces — particularly relevant if you’re building agents that need to interact with visual UIs.

For teams deciding between GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro: the right move is staged evaluation, not immediate migration. Treat GPT-5.4 as a candidate to test against your specific workloads, not an automatic replacement. Route a controlled share of traffic to the new model, run evals against your domain benchmarks, and promote it only after it clears your quality gates. The models are close enough in capability that cost structure and latency profile may be the deciding factor for your use case.

If You’re a Knowledge Worker in a Sector GPT-5.4 Targets

The 83% GDPval score deserves a direct conversation rather than euphemism. For professionals in legal, financial analysis, software development, and research synthesis — the categories explicitly covered by that benchmark — GPT-5.4 can now produce first-draft output at professional quality on the majority of tasks you’re currently being paid to do.

That doesn’t mean your role disappears next week. It means the leverage calculus is shifting. The professionals who adapt fastest — who learn to direct, evaluate, and validate AI output efficiently — will compress the work of two or three roles into one. The ones who don’t adapt will face structural pressure on compensation and headcount as organizations discover that the AI handles 83% of what they were paying full-time professionals to do.

The actionable step right now: identify which 17% of your work GPT-5.4 demonstrably cannot do, and invest in getting better at it. That’s where your competitive advantage will live for the next 24 to 36 months.

For a thorough comparison of the top AI tools available to professionals and developers today, see our guide to the best AI tools in 2026 — including which models to use for each category of work. And if you’re evaluating which AI subscription is worth the cost, our analysis of the best AI apps for iPhone and Android in 2026 covers the consumer-facing options in detail.

The Verdict

GPT-5.4 is the most capable model OpenAI has shipped, and the 1-million-token context window closes the gap with Google that had become an embarrassing competitive liability. The native computer use capability is a genuine architectural shift — not a gimmick — and the tool search design solves a real economic problem for production agent deployments.

But the launch also reveals something the hype cycle tends to obscure: the frontier has converged. GPT-5.4 and Gemini 3.1 Pro are tied on the Artificial Analysis Intelligence Index. Claude Opus 4.6 sits 4 points behind. These are different tools for different jobs, not different leagues of capability.

The real transformation isn’t which model wins. It’s that three companies have now shipped models that can independently handle professional-quality knowledge work at scale. The question of who benefits and who faces pressure from that shift doesn’t wait for the next model release.

It’s already playing out.

Business Address:

GPT-5.4 Is Here with 1 Million Tokens — Here’s What Actually Changes