All Posts
AIMay 28, 2026

Claude Opus 4.8 vs Opus 4.7: What Actually Changed (2026)

Anthropic shipped Claude Opus 4.8 on May 28, 2026, roughly a month after Opus 4.7. If you were expecting a dramatic leap across the board, this is not quite that release, but the coding gains are larger than the "incremental" label suggests. The standard list price is identical to 4.7 ($5 per million input tokens, $25 per million output tokens), the 1M token context window and 128K max output both carry over, and Opus 4.8 wins every benchmark in Anthropic's published table. Two things stand out beyond the scores: a new fast mode that runs the same model about 2.5 times faster as a paid premium tier, and the fact that 4.8 tends to resolve the same tasks while spending fewer reasoning tokens, which in our testing lowers the effective cost per resolved task on the standard tier.

Anthropic shipped Claude Opus 4.8 on May 28, 2026, roughly a month after Opus 4.7. If you were expecting a dramatic leap across the board, this is not quite that release, but the coding gains are larger than the "incremental" label suggests. The standard list price is identical to 4.7 ($5 per million input tokens, $25 per million output tokens), the 1M token context window and 128K max output both carry over, and Opus 4.8 wins every benchmark in Anthropic's published table. Two things stand out beyond the scores: a new fast mode that runs the same model about 2.5 times faster as a paid premium tier, and the fact that 4.8 tends to resolve the same tasks while spending fewer reasoning tokens, which in our testing lowers the effective cost per resolved task on the standard tier.

That reframes the decision. For a frontier model on a stable price, the question is rarely "is the new one better." It almost always is. The question is whether the improvement is large enough to justify re-qualifying your prompts and re-running your evals, because any model swap in production carries that cost. Below we walk through exactly what changed, where the token savings come from, what to re-check before you migrate, and who should move now versus wait.

Comparison Table: Opus 4.8 vs Opus 4.7

DimensionClaude Opus 4.8Claude Opus 4.7
ReleasedMay 28, 2026April 2026
Context window1M tokens1M tokens
Max output128K tokens128K tokens
Standard input cost$5 / 1M tokens$5 / 1M tokens
Standard output cost$25 / 1M tokens$25 / 1M tokens
Fast mode pricing$10 / 1M input, $50 / 1M outputnot available
Fast mode speedabout 2.5x faster, same modelnot available
Agentic coding (SWE-Bench Pro)69.2%64.3%
Agentic terminal coding (Terminal-Bench 2.1)74.6%66.1%
Multidisciplinary reasoning (Humanity's Last Exam, no tools)49.8%46.9%
Multidisciplinary reasoning (Humanity's Last Exam, with tools)57.9%54.7%
Agentic computer use (OSWorld-Verified)83.4%82.8%
Knowledge work (GDPval-AA, ELO)18901753
Agentic financial analysis (Finance Agent v2)53.9%51.5%
Best atagentic coding, terminal tasks, knowledge workagentic coding, general reasoning

What Actually Changed

The benchmark deltas are consistent and, in a couple of places, larger than the "incremental" framing suggests. Opus 4.8 wins every benchmark in Anthropic's published table. The two biggest gains are in coding: Terminal-Bench 2.1 jumps from 66.1% to 74.6% (a gain of 8.5 points), and SWE-Bench Pro climbs from 64.3% to 69.2% (a gain of 4.9 points). Knowledge work moves the most in relative terms, with the GDPval-AA ELO rising from 1753 to 1890. Reasoning improves on Humanity's Last Exam in both settings (49.8% versus 46.9% with no tools, 57.9% versus 54.7% with tools), Finance Agent v2 ticks up from 51.5% to 53.9%, and agentic computer use on OSWorld-Verified edges from 82.8% to 83.4%. None of these is a generational leap on its own, but the coding deltas in particular are large enough to show up in daily work.

Both models share the same large output ceiling: 128K tokens. If you generate large diffs, long structured documents, or multi file edits in a single response, that headroom carries over from 4.7 to 4.8, so you do not need to revisit chunking logic on the upgrade. Both also share the same 1M token context window.

The token efficiency change is the quiet headliner. Opus 4.7 had a habit of over explaining in its scratch reasoning, narrating intermediate steps it did not need to surface, which inflated output token counts on every task. In our testing, Opus 4.8 produces materially less of that filler. It reaches the same answer with fewer tokens spent thinking out loud. Because output tokens are the expensive side of the bill at $25 per million, trimming scratch verbosity lowers real spend without touching the standard price sheet.

Opus 4.8 also introduces fast mode, but it is worth being precise about what it is. Fast mode is a separate premium serving tier, priced at $10 per million input and $50 per million output, that runs the exact same Opus 4.8 model about 2.5 times faster than standard. There is no quality downgrade and no distilled substitute: you stay on the full model and simply pay more to get tokens sooner. Anthropic notes that this fast tier is roughly three times cheaper than fast serving was on previous models. For latency sensitive workloads, this is the real headline of the release. Beyond serving, tool use and multi step planning are more reliable, instruction following on structured output is tighter, and dynamic workflows now support hundreds of parallel subagents in a single session.

The Cost Story Is About Tokens, Not Price

Here is the part teams misread. The standard list price did not change. Both models are $5 per million input and $25 per million output on the standard tier. If you compare the two purely on the rate card, they look identical, and you would conclude there is no cost reason to upgrade. That conclusion is incomplete, because what you actually pay is rate times tokens, and in our testing 4.8 uses fewer tokens to finish the same job.

Two effects stack. First, less scratch verbosity means fewer output tokens per task on average. Second, the higher resolution rate (69.2% versus 64.3% on SWE-Bench Pro) means fewer expensive retries and fewer half finished tasks you have to pay for and then redo. Based on our own runs we estimate this pulls the average cost per resolved task down by roughly 10 to 15% at the same advertised standard rate. That figure is our editorial estimate from internal testing, not an official Anthropic number; your own evals are the only reliable source for your traffic.

Consider a concrete workload: requests averaging 50K input tokens and 5K output tokens, at 10,000 requests per day. On either model the nominal standard list cost per request is the same, because the rates are the same: 50K input at $5 per million is $0.25, and 5K output at $25 per million is $0.125, for $0.375 per request, or about $3,750 per day across 10,000 requests. That number is identical on both models if you assume identical token counts. The catch is that the token counts are not identical. In our testing, 4.7 emits more scratch output per task and fails slightly more often, so its true output token volume and its retry volume both run higher than the clean estimate. Opus 4.8 lands closer to the nominal figure and resolves more on the first pass, so your actual monthly invoice tends to come in lower even though the standard per token rate never moved. One important caveat: if you opt into fast mode, the rate doubles to $10 per million input and $50 per million output, so that tier is a deliberate latency for cost trade, not a free speedup. The standard list price is a floor, not the spend; the spend is governed by how many tokens each model burns to get to done and which serving tier you choose.

Migration: What to Re-Check

Prompts are largely drop in compatible between 4.7 and 4.8, so this is not a rewrite. It is a re-qualification. Treat it as you would any model bump in a production path.

Re-run your full eval suite before you flip traffic. A benchmark win on Anthropic's table does not guarantee improvement on your specific task distribution, and you want to see the deltas on your own evals, not the published ones. Re-validate any prompt that was tuned around 4.7 verbosity: if you added instructions to suppress over explaining, or if downstream parsing depended on the shape of 4.7 scratch output, those assumptions may no longer hold now that 4.8 is naturally terser. Output ceilings are unchanged at 128K on both models and the context window stays at 1M, so chunking logic built for 4.7 should carry over without edits. Decide explicitly whether any latency sensitive paths should use fast mode, and remember that doing so doubles the rate to $10 input and $50 output per million, so model that cost before you flip it on. Confirm caching behavior in staging rather than assuming it, and verify your cache hit rates hold. Watch structured output paths closely, since tighter instruction following can change exact formatting in ways your strict parsers notice.

Should You Upgrade?

For agentic coding and structured output workloads, yes. The combination of higher resolution rates (SWE-Bench Pro up to 69.2% and Terminal-Bench 2.1 up to 74.6%) and, in our testing, lower token cost per task makes 4.8 the better default for anything that writes code, edits repositories, or emits constrained formats. The token efficiency alone usually pays back the migration effort, because the savings compound across every request and the eval re-run is a one time cost.

If you are latency sensitive, fast mode is the headline reason to move, with the caveat that it is a paid tier. Running the same full Opus 4.8 model about 2.5 times faster, with no quality downgrade and no smaller model substitution, is a meaningful improvement for interactive tools and anything with a human waiting on output. You pay for it: fast mode is $10 per million input and $50 per million output, double the standard rate, so treat it as a deliberate latency for cost trade rather than a free upgrade.

If you are still on 4.6 or earlier, the case is even stronger, because you are stacking several releases of improvement at once and the cumulative benchmark and efficiency gains are well past the threshold where re-qualification is obviously worth it. The only group with a reason to wait is teams mid way through a delicate eval cycle on 4.7 who cannot spare the re-qualification window right now; for them, schedule the move rather than skip it.

When This Applies to Your Stack

Most of the real work in adopting a new frontier model is not the model swap. It is the plumbing around it. At Contra Collective we build the AI integration layers that make these upgrades boring: gateways that route between models, caching layers that protect your token budget, agent harnesses that keep tool use reliable, and eval pipelines that tell you whether a new model actually helps on your traffic before you ship it.

When that infrastructure exists, moving from Opus 4.7 to 4.8 is a config change and an eval run, not a project. When it does not, every model release turns into a scramble. If you are running models in production and the upgrade path feels riskier than it should, the fix is almost always the harness, not the model. We are happy to help you build or harden that layer so the next release after 4.8 is a one line change.

FAQ

Is Opus 4.8 worth upgrading to from Opus 4.7?

For most agentic coding and structured output workloads, yes. SWE-Bench Pro rises from 64.3% to 69.2%, Terminal-Bench 2.1 from 66.1% to 74.6%, and the GDPval-AA knowledge work ELO from 1753 to 1890. In our testing the token efficiency gain typically covers the cost of re-running your evals. The main reason to wait is if you are mid cycle on a 4.7 eval you cannot pause.

Did the price change?

The standard rate did not. Both Opus 4.7 and Opus 4.8 are $5 per million input tokens and $25 per million output tokens on the standard tier. What is new is fast mode, a separate premium tier on 4.8 at $10 per million input and $50 per million output. On the standard tier, the savings we see on 4.8 come from spending fewer tokens per task, not from a lower rate; the list price is a floor, and what you actually pay depends on how many tokens the model burns to finish.

Is Opus 4.8 a drop in replacement for 4.7?

Largely, yes. Prompts are mostly compatible and the API surface is the same. Treat it as a re-qualification rather than a rewrite: re-run your evals and re-validate any prompts tuned around 4.7 verbosity. The 128K max output and the 1M context window are unchanged from 4.7, so chunking and context assumptions carry over. Verify caching behavior in staging rather than assuming it.

What is fast mode?

Fast mode is a separate premium serving tier on Opus 4.8 that runs the exact same model about 2.5 times faster than standard, with no quality downgrade and no distilled substitute. It is not free: it is priced at $10 per million input and $50 per million output, double the standard rate. Anthropic notes it is roughly three times cheaper than fast serving was on previous models. It is the headline feature if you are latency sensitive and willing to pay for speed.

How much better is Opus 4.8 at coding?

On SWE-Bench Pro, 4.8 scores 69.2% against 4.7's 64.3%, a gain of 4.9 points, and on Terminal-Bench 2.1 it jumps from 66.1% to 74.6%, a gain of 8.5 points. Beyond the raw scores, more reliable tool use, multi step planning, and dynamic workflows with hundreds of parallel subagents make it the better choice for large diffs and multi file edits. For a deeper cross model view, see our SWE-Bench Verified frontier models leaderboard.

[ 02 ] — Keep Reading

More from the lab.

Jun 1, 2026AI

GPT-5.5 vs Gemini 3.1 Pro: Enterprise Workloads Tested (2026)

GPT-5.5 and Gemini 3.1 Pro are the two frontier models most enterprise procurement conversations now circle back to. Claude Opus 4.8 sits at the top of agentic coding, but for general enterprise reasoning, long document analysis, and structured extraction, the practical choice in mid 2026 is between OpenAI and Google. Both clear the capability bar. The decision is about second-order properties: how each handles long context degradation, structured output reliability, latency under load, and where the cost curve actually lands at production token volume.

Jun 1, 2026AI

MLX Continuous Batching: Throughput Architecture on Apple Silicon (2026)

Continuous batching is the single largest throughput unlock for transformer inference on any hardware. NVIDIA stacks have spent three years optimizing around it: vLLM, TensorRT-LLM, and SGLang all converge on the same pattern of paged KV cache, request scheduling, and prefill/decode interleaving. MLX is younger, the runtime is different, and most NVIDIA intuitions do not survive contact with Apple's unified memory architecture.

May 31, 2026AI

Qwen 3 Coder vs Claude Opus 4.8: SWE-Bench Verified Tested (2026)

Qwen 3 Coder is the strongest open weight coding model shipping today. Claude Opus 4.8 is the closed source leader on agentic coding workloads. The conversation about which to use in production usually collapses into a benchmark argument, and the benchmarks alone do not capture what actually matters.

Ready when you are

Want to discuss this topic?

Start a Conversation