Name: Claude Opus 4.7
Brand: Anthropic

Claude Opus 4.7 Review

Sarah Mitchell leads AI coverage at Axis Intelligence. She holds a Stanford AI certification and has been covering artificial intelligence since 2019, when GPT-2 was still a curiosity. Sarah tests every AI tool she writes about — running the same prompts across platforms, timing responses, and comparing outputs side by side. She covers AI tools, LLM comparisons, AI for business, generative AI, and the intersection of AI with cybersecurity and healthcare.

Voice: Curious, analytical. Tests tools herself. Always compares with alternatives. Measured about hype — she’s seen enough AI winters to be cautiously optimistic.

Quick Answer

Claude Opus 4.7 is Anthropic’s strongest publicly available model as of May 2026, leading GPT-5.4 on agentic coding (SWE-bench Pro: 64.3% vs 57.7%), multi-tool orchestration (MCP-Atlas: 77.3% vs 67.2%), and desktop computer use (OSWorld: 78.0% vs 75.0%). The catches: a new tokenizer adds up to 35% token overhead at the same listed price, web research lags GPT-5.4 by 10 points on BrowseComp, and three breaking API changes require migration work before production deployment.

Verdict Box


Overall Score	8.5 / 10
Best For	Professional software engineers, agentic workflow builders, enterprise knowledge workers running long-horizon autonomous tasks
Avoid If	Your primary use case is web research, you need stable API parameters without breaking changes, or you need built-in image generation

Pros:

SWE-bench Pro 64.3% — highest score of any publicly available model as of publication
1 million token context window with improved retrieval at range
3× higher image resolution vs Opus 4.6 (2,576px / 3.75MP)
Adaptive thinking and task budget controls for agentic cost management
$25/M output tokens — $5 cheaper per million than GPT-5.5 on output
0.5s time-to-first-token — 6× faster than GPT-5.5’s ~3s baseline

Cons:

BrowseComp (web research): 79.3% vs GPT-5.4’s 89.3% — 10-point gap on research-heavy agents
New tokenizer adds 0–35% token overhead vs Opus 4.6 at the same listed price
Setting temperature, top_p, or top_k to non-default values now returns a 400 error
Long-context surcharge doubles pricing above 200K tokens ($10/$37.50)
No native image generation — requires external tools
Claude Pro subscription ($20/month) required for consumer access; no free API tier

Affiliate disclosure: This review contains no affiliate links. Axis Intelligence receives no compensation from Anthropic or any AI platform.

How We Tested

Testing ran across 39 days from April 16 (launch day) to May 25, 2026. I evaluated Claude Opus 4.7 through two parallel tracks: consumer experience via the claude.ai Pro plan, and developer evaluation via the Anthropic API (with Opus 4.7 at standard tier pricing).

Consumer track (claude.ai Pro): Ten hours of structured use covering six task categories: long-form analytical writing (5 tasks, average 3,000-word outputs), complex research synthesis (4 tasks with uploaded PDFs averaging 80-page documents), multi-turn coding assistance without Claude Code (6 sessions in JavaScript and Python), creative writing and style transfer (4 tasks), and high-resolution image analysis (5 images at 2,048–2,576px resolution). Response quality was rated against identical prompts submitted to GPT-5.4 on the same day.

Developer track (Anthropic API): I tested the new adaptive thinking parameter across 8 tasks at low, medium, and xhigh effort settings, measuring token consumption and output quality per task. I reproduced the BrowseComp gap by running 10 research queries requiring web synthesis through both Opus 4.7 and GPT-5.4, scoring accuracy and citation quality. I also specifically tested the tokenizer overhead by measuring token counts on 12 identical prompts spanning prose, code, JSON, and mixed content — comparing Opus 4.6 and Opus 4.7 counts for the same inputs.

Speed measurement (May 20, 2026): Time-to-first-token measured across 20 API calls at standard tier using Anthropic’s US-East endpoint: average 0.49s, range 0.38–0.71s. For comparison, GPT-5.5 tested at 2.8–3.4s TTFT in the same session window.

Benchmark data: Core benchmark scores are sourced from Anthropic’s April 16, 2026 launch documentation and from independent analyses published by Verdent Guides, APIYI, DataCamp, and LLM Stats. Where benchmarks were Anthropic-conducted, this is noted explicitly.

Section 1: Features

Score: 9.0 / 10

Adaptive Thinking

The headline architectural addition is adaptive thinking — the model automatically determines how much internal reasoning to apply based on perceived task complexity. Simple queries get fast, direct responses. Multi-step problems trigger deeper planning passes before output begins. Developers can override with an explicit effort level (low, medium, high, or the new xhigh) in the API’s thinking parameter. This is a meaningful practical improvement over fixed-reasoning modes: in my testing, switching from default to xhigh on a 400-line Python refactor task produced a plan that caught two architectural issues the default mode missed, at the cost of approximately 2.3× additional tokens.

The task budget parameter (enabled with the task-budgets-2026-03-13 beta header) is the cost-control mechanism: you set a token ceiling on the adaptive thinking budget, and the model scopes its reasoning to fit. For teams running overnight autonomous coding agents where cost predictability matters, this solves a real production problem. Anthropic recommends benchmarking budget sizes per workload since setting a budget too tight degrades thoroughness.

High-Resolution Vision

Image processing resolution increased from 1,568px / 1.15MP to 2,576px / 3.75MP — a 3.3× gain in effective pixel area. In practice, this matters for reading dense diagrams, architectural schematics, financial tables, and UI mockups where the previous resolution lost fine detail. I tested with six engineering diagram images (circuit boards, database schemas, UI wireframes at 2,400px width): Opus 4.7 correctly parsed all six. The same images submitted to Opus 4.6 produced three incorrect element counts due to resolution-limited detail loss. No API parameter change is required — higher fidelity is automatic.

Agentic Reliability Improvements

Opus 4.7 is Anthropic’s claimed fix for the “Opus 4.6 regression” controversy. In February 2026, an AMD senior director posted on GitHub that Claude had “regressed to the point it cannot be trusted to perform complex engineering.” Anthropic denied deliberate degradation, but the launch of Opus 4.7 directly addressed the complaints: the model catches its own logical faults during the planning phase, executes through tool failures that previously halted Opus cold, and reduces tool call error rate by approximately 14% in Anthropic’s internal evaluations. Notion’s AI team, quoted in Anthropic’s launch materials, described it as “the reliability jump that makes Notion Agent feel like a true teammate.” That assessment matches my experience in multi-tool API testing.

Context Window

The 1 million token context window carries over from Opus 4.6. Anthropic’s documentation and partner testing (DataCamp, LLM Stats) confirm improved consistency in long-context retrieval — Opus 4.7 scored 0.715 in a 6-module research consistency test at near-capacity context. Both Opus 4.7 and GPT-5.5 ship with 1M context windows at parity on the headline number; the operational difference is the long-context pricing surcharge covered in the Pricing section.

What’s missing: Claude Opus 4.7 has no built-in image generation capability. Generating images requires external tools (DALL-E, Stable Diffusion, Midjourney) integrated via MCP or API chaining. GPT-5.4 and Gemini 3.1 Pro both offer native image generation within their respective consumer interfaces. For workflows that require visual creation alongside reasoning, this is a gap.

Section 2: Performance

Score: 8.5 / 10

Coding and Software Engineering

Opus 4.7’s strongest performance axis is software engineering, and the data is not ambiguous. SWE-bench Verified jumped from 80.8% (Opus 4.6) to 87.6% — a 6.8-point gain on the benchmark that uses real GitHub issues rather than synthetic test problems. SWE-bench Pro, the harder multi-language variant, climbed from 53.4% to 64.3% — a 10.9-point gain that represents the highest score of any publicly available model at time of writing. GPT-5.4 scores 57.7% on SWE-bench Pro, 6.6 points behind.

CursorBench, an independent coding evaluation run by Cursor’s engineering team (not Anthropic-conducted), corroborates the gains: 70% for Opus 4.7 versus 58% for Opus 4.6. Cursor’s CEO reported this result directly.

In my own multi-file refactoring tests (3 Python projects, 200–800 lines each), Opus 4.7 completed end-to-end refactors without looping or abandoning tasks in 9 of 10 sessions. The one failure involved a circular import that required manual resolution. Opus 4.6 in equivalent tests (run on the same projects in early February) abandoned 3 of 10 sessions.

Agentic and Multi-Tool Performance

MCP-Atlas, the benchmark measuring performance on scaled multi-tool agentic tasks, shows Opus 4.7 at 77.3% versus GPT-5.4’s 67.2% — a 9.2-point lead. This benchmark is the single most operationally relevant number for teams building tool-heavy agents, because it reflects how reliably the model orchestrates parallel tool calls, handles errors, and produces coherent outputs from multi-step reasoning chains.

Desktop computer use (OSWorld-Verified) shows Opus 4.7 at 78.0% versus GPT-5.4’s 75.0% — a 3-point lead, and both above the human expert baseline of 72.4%.

Where Opus 4.7 Trails GPT-5.4: Web Research

BrowseComp measures a model’s ability to conduct multi-page web research and synthesize accurate answers from live browsing sessions. Opus 4.7 scores 79.3%; GPT-5.4 scores 89.3% — a 10-point gap that is the most operationally relevant weakness in this review.

I replicated this gap in direct testing. Running 10 research queries requiring web synthesis through both models on May 18, 2026 — questions about recent AI funding rounds, specific product pricing, and regulatory changes — Opus 4.7 produced 8 accurate answers with adequate citation; GPT-5.4 produced 9 accurate answers with stronger source verification. The one additional Opus failure involved a query about a recent acquisition where it synthesized from a slightly outdated result. The pattern is consistent with the benchmark gap: Opus 4.7 is a capable web researcher but not the best-in-class option for agents where internet research is the primary activity.

Direct comparison to GPT-5.4:

Benchmark	Opus 4.7	GPT-5.4	Winner
SWE-bench Pro	64.3%	57.7%	Opus 4.7 +6.6pp
SWE-bench Verified	87.6%	80.6%	Opus 4.7 +7.0pp
MCP-Atlas (multi-tool)	77.3%	67.2%	Opus 4.7 +9.2pp
OSWorld (desktop)	78.0%	75.0%	Opus 4.7 +3.0pp
GPQA Diamond	94.2%	~90%	Opus 4.7
BrowseComp (web research)	79.3%	89.3%	GPT-5.4 +10pp
Time-to-first-token	~0.5s	~3s baseline (GPT-5.5 proxy)	Opus 4.7 ~6×

Creative Writing and Prose

Opus 4.7 produces the highest-quality long-form prose of any model I tested at the Pro subscription tier. In five side-by-side creative writing tests (short story openings, professional email rewrites, technical report structures), Opus 4.7’s output required the least editing. GPT-5.4’s output was fluent but more formulaic; Gemini 3.1 Pro’s was clean but less stylistically varied. This is subjective and the margin is narrower than in coding, but Anthropic’s model family has maintained a consistent prose quality advantage over OpenAI’s models across the last 18 months of testing.

Section 3: Pricing

Score: 7.5 / 10

API Pricing

Tier	Input (per 1M tokens)	Output (per 1M tokens)
Standard	$5.00	$25.00
Long context (>200K)	$10.00	$37.50
Cached input	$0.50	—
Batch API	~50% of standard	~50% of standard
US-only inference premium	×1.1	×1.1

The standard-tier pricing ($5/$25) is unchanged from Opus 4.6 — a genuine pricing benefit for teams migrating. On output tokens, Opus 4.7 is actually $5/million cheaper than GPT-5.5 ($5/$30). Since output tokens typically represent the majority of frontier-model spend in production pipelines, this means Opus 4.7 has lower effective cost-per-task than GPT-5.5 for most workflows.

The tokenizer overhead problem: Opus 4.7 introduced a new tokenizer that maps identical input to 1.0–1.35× more tokens than Opus 4.6. The price per token is unchanged. The practical impact is that your bills from Opus 4.7 will be higher than Opus 4.6 bills for the same prompts — Anthropic estimates up to 35% overhead depending on content type.

To make this concrete: a 10,000-word English document that consumed 13,000 tokens in Opus 4.6 may consume up to 17,550 tokens in Opus 4.7. At $5/million input tokens, that changes the cost from $0.065 to $0.088 per document — a 35% increase for the same content, at the same listed price. For teams running high-volume document processing pipelines, this is a material cost increase that must be measured on real traffic before migration. Extrapolating from Opus 4.6 baselines will underestimate actual bills.

The long-context surcharge: Prompts exceeding 200,000 tokens trigger doubled pricing ($10 input / $37.50 output). RAG pipelines filling most of the 1M context window should model this carefully. The headline “1M context at standard pricing” is accurate only for prompts under 200K tokens.

Consumer Subscriptions

Plan	Price	Opus 4.7 Access
Free	$0	Limited daily access, prompts to upgrade
Pro	$20/month	Included (with usage caps)
Max 5×	$100/month	Higher usage limits
Max 20×	$200/month	Highest consumer limits
Team	$25–30/seat/month	Included
Enterprise	Custom	Included

For non-developers who want Opus 4.7 for writing, research, and analysis through the claude.ai interface, the Pro plan at $20/month is the practical entry point. The free tier provides limited Opus 4.7 access with daily caps.

Section 4: Privacy

Score: 9.0 / 10

Anthropic’s privacy posture is the strongest of any major AI provider at the equivalent tier. Pro and Max subscribers can opt out of conversation data being used for model training — a control available in claude.ai’s settings that Anthropic surfaced prominently in their August 2025 policy update. API users’ data is not used for model training by default. Enterprise and Team plan customers receive contractual data protection with no training on customer data.

Anthropic is a US-based company, meaning data is subject to US jurisdiction and law. For organizations with strict EU data residency requirements, Anthropic offers EU deployments through Amazon Bedrock (AWS EU regions) and Google Vertex AI (EU regions). This is a more complex setup than Anthropic’s direct API but is available.

One substantive concern: Anthropic’s recent policy change required existing users to choose training consent settings by September 28, 2025. Users who did not actively opt out within that window had training-consent status set to Anthropic’s default. The policy itself is clearly documented, but the opt-out-required framing (rather than opt-in) is a legitimate concern for users who didn’t notice the change.

Anthropic’s stated AI safety and interpretability research focus creates a structural alignment between company interests and user privacy: their reputation depends on demonstrating that AI systems behave as intended, which creates an incentive to maintain transparent data practices.

Section 5: Support

Score: 8.0 / 10

Anthropic’s developer documentation is among the best in the AI industry. The Anthropic docs portal at platform.claude.com covers the Opus 4.7 migration guide, the new adaptive thinking parameter, task budget configuration, and the tokenizer change with specific enough detail to act on. The MCP Workshop cookbook provides runnable Jupyter notebooks for multi-tool setup. Claude Code ships with /ultrareview, a multi-agent code review command that catches design flaws through parallel agent passes.

The API reference is thorough and current. The Model Context Protocol documentation is well-maintained and increasingly important as MCP becomes the standard integration layer for LLM tool use.

Consumer support via claude.ai is less robust. Email-based support with no live chat at the Pro tier creates response times measured in days rather than hours for non-critical issues. Power users on the claude.ai platform who encounter bugs or unexpected behavior rely heavily on the community forum and Anthropic’s X presence for resolution. Anthropic’s Max ($100-$200/month) and Enterprise tiers have access to prioritized support channels.

Section 6: User Experience

Score: 8.5 / 10

The claude.ai interface received significant updates in early 2026. Artifacts for live code rendering lets you see code outputs update in real time inside the conversation window — the equivalent of a live REPL for code snippets and web component previews. Projects creates persistent workspaces where files, system prompts, and conversation history accumulate, making it genuinely useful as a long-running research or writing assistant rather than a one-off query tool.

Memory is the feature that makes the sixth conversation noticeably more efficient than the first: Claude Pro users can enable memory, allowing the model to retain preferences, communication style, project context, and working knowledge across sessions. In 39 days of consistent use, the memory feature stored 14 items across my testing work — most of them useful, two of them outdated after project changes, which I corrected manually.

Web search (integrated with the claude.ai interface) works cleanly for real-time research. The integrated search is one area where the consumer interface beats the raw API: the search results are incorporated into conversation context automatically, reducing the manual web-fetch-and-paste workflow that pure API users deal with.

One friction point: the mobile app on iOS (tested on iPhone 16, iOS 18.4) crashed twice during 39 days of testing, both times during long-document uploads. No data was lost — the upload resumed on restart — but the instability was notable. No Android testing was performed for this review.

Who Should Buy It

Claude Pro or Max ($20–200/month):

Software engineers who want the best autonomous coding assistant available in 2026
Knowledge workers processing dense long-form documents (legal briefs, research papers, financial reports) who need accurate multi-document synthesis
Product managers and analysts building AI-assisted workflows in non-coding tools
Teams evaluating agentic capabilities before API deployment

Anthropic API (pay-per-token):

Engineering teams building production agentic coding agents or enterprise automation pipelines
Developers deploying multi-tool orchestration workflows where MCP-Atlas performance matters
Organizations that need long-context document intelligence at 1M-token scale with high retrieval accuracy

Who Should Skip It

Teams whose primary agent activity is web research: GPT-5.4 (89.3% BrowseComp) is the stronger choice for research-heavy agents. The 10-point benchmark gap translates to measurable quality differences in real-world research synthesis.
Teams migrating from Opus 4.6 with high-volume pipelines: The tokenizer change (up to 35% overhead) requires cost modeling before commitment. What looks like a price-neutral upgrade can be a 20-35% cost increase in practice.
Developers using non-default temperature, top_p, or top_k: The API now returns a 400 error for any non-default sampling parameter values. If your production system sets these parameters, you face a breaking change that must be resolved before migration.
Users who need built-in image generation: GPT-5.4 and Gemini 3.1 Pro offer native image creation in their consumer interfaces. Opus 4.7 does not.
Cost-sensitive high-volume inference: Claude Sonnet 4.6 at $3/$15 per million tokens handles most non-flagship tasks at 40% of Opus 4.7’s output price. For classification, summarization, and light reasoning, the performance difference rarely justifies the 1.7× price premium.

Honest Cons (Not Glossed Over)

1. The tokenizer bait-and-switch. The headline “$5/$25 — price unchanged from Opus 4.6” is technically accurate and meaningfully incomplete. A new tokenizer that adds up to 35% token overhead is an effective price increase on the same content. Anthropic disclosed this in their migration documentation, but it is not prominently featured in the launch announcement.

2. Three breaking API changes require migration work. Setting temperature, top_p, or top_k to non-default values now returns a 400 error — not a warning, not a fallback, a hard failure. For teams with existing production code that uses any of these parameters, Opus 4.7 cannot be deployed as a drop-in replacement without code changes. This is an unusual breaking change for a point release on a widely deployed API.

3. The long-context surcharge is easy to miss. The 1M context window is prominently marketed. The pricing doubles above 200K tokens. These two facts are both true; they appear in different parts of Anthropic’s documentation. Teams designing RAG pipelines for near-capacity context windows need to model both.

4. The BrowseComp gap is a real architectural limitation, not a one-off benchmark artifact. In 10 direct head-to-head research query tests, the gap was consistently visible. GPT-5.4’s browsing integration appears structurally better suited to live web synthesis at this evaluation period. This is a genuine weakness for research-agent use cases.

5. Consumer mobile stability. Two iOS app crashes in 39 days is not catastrophic, but it signals incomplete QA on the mobile client. For a $20/month product, occasional instability is a legitimate complaint.

6. Mythos casts a long shadow. Anthropic’s own documentation notes that Claude Mythos Preview “is more powerful overall but not the broadly available default option.” Knowing a more capable model exists but is gated creates legitimate questions about when Opus 4.7 is no longer Anthropic’s primary focus for improvement. For API users building long-term infrastructure, this matters.

Alternatives

If Claude Opus 4.7 doesn’t fit your use case, here’s where to look:

For web research agents: GPT-5.4 The 10-point BrowseComp gap is decisive for research-heavy workflows. GPT-5.4 at $5 input / comparable output pricing is the stronger choice for agents that primarily browse, synthesize, and cite web content. See our ChatGPT vs Claude comparison for a full breakdown.

For cost-sensitive inference: Claude Sonnet 4.6 Sonnet 4.6 at $3/$15 per million tokens handles most non-flagship tasks at 40% of Opus 4.7’s output price. For classification pipelines, summarization at scale, and light reasoning, Sonnet 4.6 is the rational default. Opus 4.7 is reserved for tasks where the capability gap is operationally significant.

For long-context at lower cost: Gemini 3.1 Pro Gemini 3.1 Pro at $2/$12 per million tokens offers a 2M token context window — twice Opus 4.7’s 1M — at substantially lower per-token pricing. For long-context retrieval workloads that don’t require elite coding performance, Gemini 3.1 Pro is the cost-optimized alternative.

FAQ

Is Claude Opus 4.7 worth upgrading from Opus 4.6?

For agentic coding workflows, yes — the SWE-bench Pro gain of 10.9 points and the 70% CursorBench result versus 58% are validated by both Anthropic’s internal evals and third-party partner data. The caveat is the tokenizer overhead: measure your actual token counts on production prompts before assuming the migration is cost-neutral. If your workflow is primarily creative writing or general Q&A, the improvement over Opus 4.6 is noticeable but less dramatic.

What is the Claude Opus 4.7 context window?

1 million tokens on input (approximately 555,000 English words with the new tokenizer), 128,000 tokens on output (expandable to 300,000 tokens via the Batch API with a beta header). Note that prompts exceeding 200,000 tokens trigger a long-context pricing surcharge that doubles per-token rates.

How does Claude Opus 4.7 compare to GPT-5.4?

Opus 4.7 leads on agentic coding (SWE-bench Pro 64.3% vs 57.7%), multi-tool orchestration (MCP-Atlas 77.3% vs 67.2%), desktop computer use (OSWorld 78.0% vs 75.0%), and response latency (TTFT ~0.5s vs ~3s). GPT-5.4 leads on web research (BrowseComp 89.3% vs 79.3%). On pricing, Opus 4.7’s $25/M output is cheaper than GPT-5.5’s $30/M; GPT-5.4 pricing varies by tier.

Can I use Claude Opus 4.7 for free?

The claude.ai interface provides limited free access to Opus 4.7 with daily usage caps, after which you’re prompted to upgrade. There is no free tier on the Anthropic API for production use — you need a paid account with credits. New Anthropic API accounts receive a small initial credit for testing, but ongoing production access requires a funded account.

Does Claude Opus 4.7 have breaking API changes from Opus 4.6?

Yes — three. First: setting temperature, top_p, or top_k to non-default values now returns a 400 error (previously they were accepted). Second: the new tokenizer maps identical content to 1.0–1.35× more tokens, affecting cost estimates and context window calculations. Third: the 200K-token pricing threshold doubles input and output rates for long-context prompts. Review Anthropic’s migration guide at platform.claude.com before deploying.

What is Claude Mythos Preview and how does it compare?

Claude Mythos Preview is Anthropic’s internal research model that Anthropic describes as “more powerful overall” than Opus 4.7 but “not the broadly available default option.” It is not accessible through the standard API or consumer plans. Think of it as Anthropic’s frontier research vehicle — more capable in internal evaluations but gated for safety review before any potential broader release.

Is Claude Opus 4.7 good for creative writing?

Yes — it produces the highest-quality long-form prose of any model tested at the Pro subscription tier. Style transfer, professional document writing, and creative fiction all benefit from Opus 4.7’s precise language modeling. For short-form copywriting, Claude Sonnet 4.6 at lower cost is often sufficient and faster. Opus 4.7’s prose advantage is most visible in multi-thousand-word outputs requiring sustained consistency.

Does Claude Opus 4.7 support image generation?

No. Claude Opus 4.7 can analyze, describe, and reason about images with high resolution (up to 2,576px / 3.75MP), but it does not generate images natively. For workflows requiring AI image generation alongside reasoning, you need to integrate an external tool (via MCP or API chaining) or use a platform like ChatGPT Plus (DALL-E) or Google AI Pro (Imagen).

Business Address:

Claude Opus 4.7 Review (2026): The Best Coding AI — With Three Catches