All Posts
AIMay 31, 2026

Qwen 3 Coder vs Claude Opus 4.8: SWE-Bench Verified Tested (2026)

Qwen 3 Coder is the strongest open weight coding model shipping today. Claude Opus 4.8 is the closed source leader on agentic coding workloads. The conversation about which to use in production usually collapses into a benchmark argument, and the benchmarks alone do not capture what actually matters.

Qwen 3 Coder is the strongest open weight coding model shipping today. Claude Opus 4.8 is the closed source leader on agentic coding workloads. The conversation about which to use in production usually collapses into a benchmark argument, and the benchmarks alone do not capture what actually matters.

We ran both models head to head on SWE-Bench Verified, the Aider polyglot suite, and a private corpus of real pull requests from Contra Collective client engagements. Here is the data, the failure modes, and the decision matrix.

Headline Numbers

BenchmarkQwen 3 Coder 480BClaude Opus 4.8Delta
SWE-Bench Verified (resolved %)68.4%74.6%+6.2 pp Opus
Aider polyglot (pass@2)71.2%79.8%+8.6 pp Opus
HumanEval (pass@1)91.3%93.1%+1.8 pp Opus
LiveCodeBench (Apr-May 2026)64.7%70.2%+5.5 pp Opus
Input cost (per 1M tokens)$0.50 (self-hosted estimate)$15.0030x Qwen
Output cost (per 1M tokens)$1.50 (self-hosted estimate)$75.0050x Qwen
Context window256K1M4x Opus
Latency p50 (medium request)1.8s (vLLM, 8xH100)2.4s (Anthropic API)25% faster Qwen
Open weightsYes (Apache 2.0)Non/a

The benchmark gap is 6 to 9 percentage points on agentic coding tasks, which is real but not catastrophic. The cost gap is 30 to 50x. That ratio is the entire conversation.

SWE-Bench Verified, Honestly Measured

SWE-Bench Verified is the canonical test for agentic coding. We ran both models through the official harness with identical scaffolding: SWE-agent 1.4 with the standard tool budget, no retrieval augmentation beyond the repo, and a 30 turn limit per task.

Qwen 3 Coder 480B resolved 342 of 500 verified tasks, or 68.4 percent. Claude Opus 4.8 resolved 373, or 74.6 percent. The 31 task gap clusters in three categories:

  1. Multi file refactors where the fix requires touching three or more files with consistent symbol changes. Opus handles these about 18 points better. Qwen tends to fix the local symptom and miss the call sites elsewhere.
  2. Test inference where the failing test gives indirect signal about the bug location. Opus is meaningfully better at reading test output and locating the cause in unrelated code.
  3. Long horizon tasks that need 20+ turns of tool use. Qwen's degradation in deep agent loops is real and shows up in the resolution curve past turn 15.

For tasks under 10 turns, the gap closes to under 3 percentage points. For tasks under 5 turns (the bulk of production coding agent usage), the gap closes to within noise.

Aider Polyglot Says Something Different

Aider polyglot tests pass@2 across six languages with a fixed editing interface. It is closer to the workflow most teams use coding LLMs for: surgical edits to a known set of files.

Opus 4.8 hits 79.8 percent. Qwen 3 Coder hits 71.2 percent. The 8.6 point gap is consistent across languages with one exception: Rust, where Qwen actually beats Opus by 2.4 percentage points. Both models struggle with the same C++ template metaprogramming tasks. Both excel at Python and TypeScript.

The language specific data matters for production deployment. If your codebase is primarily Rust, Qwen is the better choice on both cost and quality. If it is Python, Opus's edge is real and worth paying for on critical paths.

The PR Replay Corpus

Benchmark numbers are interesting. Replaying real production work is informative.

We pulled 200 closed pull requests from Contra Collective client engagements (Shopify Plus integrations, NetSuite middleware, custom Rails apps) and ran both models through the task of generating the PR from the ticket description plus repo access. Human reviewers scored each output on three dimensions: technical correctness, code style match, and whether the PR would actually get merged.

MetricQwen 3 CoderClaude Opus 4.8
Technical correctness (1-5)3.84.4
Code style match (1-5)3.44.1
Would-merge rate41%62%
First time green CI38%58%
Tokens per PR (avg)18,40014,200
Cost per PR$0.018$1.27

Opus wins meaningfully on every quality dimension. The style match gap is interesting because both models had identical access to repo conventions; Opus reads the existing code more carefully and matches its patterns more reliably.

But Opus costs 70x more per PR. At 41 percent merge rate, Qwen generates a usable PR for roughly $0.044 of compute. Opus generates one for $2.05. If your engineering review process is solid and you treat the LLM as a first draft generator, Qwen's economics are unbeatable. If you are trying to ship without human review, Opus's higher quality is worth the cost.

Production Deployment Reality

The cost numbers assume self-hosted Qwen on your own GPUs. That is a real operational commitment.

Qwen 3 Coder 480B at FP8 needs roughly 480 GB of GPU memory. The realistic single node deployment is 8x H100 80GB or 8x H200, running under vLLM 0.7 with tensor parallel. That is $15 to $30 per hour of compute, depending on cloud, plus the engineering time to run it. Below roughly 50 million tokens per day, hitting an inference provider like Together AI or Fireworks (both serve Qwen 3 Coder at $0.40 to $0.60 per million tokens) is cheaper than self hosting.

Above 50 million tokens per day, self hosting pays for itself within months. Above 200 million tokens per day, the savings versus Opus are seven figure annual numbers.

Opus has zero operational overhead. The API just works. There is no GPU procurement, no vLLM tuning, no on call rotation for inference failures. For teams without infrastructure capacity, that is worth a lot.

Where Each Model Wins

Use Qwen 3 Coder when:

  • You process more than 50M tokens per day of coding traffic and have infrastructure to run it
  • Cost per request matters more than 8 percentage points of quality on agentic tasks
  • Your code is primarily Python, TypeScript, or Rust
  • You need data residency on premise (open weights, no API calls)
  • You are using the LLM as a first draft generator with human review
  • Your tasks complete in under 10 agent turns

Use Claude Opus 4.8 when:

  • Your agent tasks regularly run 15+ turns
  • Multi file refactor quality matters (your codebase is large and interconnected)
  • You cannot operationally support a 480B model deployment
  • You ship LLM output without human review on critical paths
  • Your codebase requires the longest context window (Opus 1M vs Qwen 256K)
  • Latency per request is less important than quality per request

Use both:

  • Most production deployments end up here. Qwen as the default for high volume coding traffic, Opus as the escalation path for the cases Qwen fails on. Routing on task complexity (estimated turn count, file count, language) saves 60 to 80 percent of cost versus Opus only while preserving most of the quality.

What Changed Since Qwen 3.6

Qwen 3 Coder is a different architecture from the general purpose Qwen 3.6 27B we benchmarked earlier this month. The Coder variant is a 480B sparse mixture of experts model with 35B active parameters per token, trained on a different corpus with much heavier weight on real code repositories, PR diffs, and CI signal. The result is that on coding workloads it is closer to Opus than to its same generation chat model. On general reasoning, Qwen 3.6 27B is still the right open weights choice for most production agent work outside of coding.

When This Matters for Your Stack

The cost arbitrage between open weights and frontier closed models is real and growing. For coding workloads specifically, the gap between Qwen 3 Coder and Claude Opus 4.8 is small enough that most teams should be running both and routing intelligently. The teams who continue paying full Opus rates for every line of agent generated code are leaving meaningful margin on the table.

At Contra Collective we help engineering teams architect production AI coding pipelines, including model routing strategies, self-hosted vLLM deployments, and the evaluation infrastructure to know when to escalate from open weights to closed source. If you are running coding agents at volume and the LLM bill is becoming a problem, the routing math is usually worth the engineering work.

FAQ

What is the cheapest way to access Qwen 3 Coder if I do not want to self host?

Together AI and Fireworks both serve Qwen 3 Coder at roughly $0.40 input and $1.50 output per million tokens. That is 30 to 50x cheaper than Opus and adds about 200ms of network latency versus self-hosted.

Does Qwen 3 Coder support tool calling as well as Claude Opus 4.8?

It supports the standard function calling format and works with SWE-agent, OpenHands, and most agentic scaffolds. Empirically, tool call accuracy is about 4 percentage points behind Opus on multi step tool sequences and roughly equal on single tool calls.

Can I fine tune Qwen 3 Coder on my codebase?

Yes, the Apache 2.0 license permits fine tuning and commercial use. LoRA fine tuning on 8x H100 takes 6 to 12 hours for typical adapter ranks. Full fine tuning of the base model needs roughly 80 GPU days and is rarely the right call when LoRA suffices.

Which one does Claude Code use under the hood?

Claude Code routes to Claude Opus 4.8 and Claude Sonnet 4.6 depending on task. There is no open weights routing in Claude Code today. If you want Qwen 3 Coder behind a similar agent harness, look at OpenHands, Cline, or Aider, all of which support arbitrary OpenAI compatible endpoints.

How does this compare to GPT-5.5 on coding?

GPT-5.5 lands between Qwen 3 Coder and Opus 4.8 on SWE-Bench Verified (71.8 percent in our testing) at a price point closer to Opus. We covered the head to head in our GPT-5.5 vs Claude Opus 4.8 comparison.

[ 02 ] — Keep Reading

More from the lab.

Jun 1, 2026AI

GPT-5.5 vs Gemini 3.1 Pro: Enterprise Workloads Tested (2026)

GPT-5.5 and Gemini 3.1 Pro are the two frontier models most enterprise procurement conversations now circle back to. Claude Opus 4.8 sits at the top of agentic coding, but for general enterprise reasoning, long document analysis, and structured extraction, the practical choice in mid 2026 is between OpenAI and Google. Both clear the capability bar. The decision is about second-order properties: how each handles long context degradation, structured output reliability, latency under load, and where the cost curve actually lands at production token volume.

Jun 1, 2026AI

MLX Continuous Batching: Throughput Architecture on Apple Silicon (2026)

Continuous batching is the single largest throughput unlock for transformer inference on any hardware. NVIDIA stacks have spent three years optimizing around it: vLLM, TensorRT-LLM, and SGLang all converge on the same pattern of paged KV cache, request scheduling, and prefill/decode interleaving. MLX is younger, the runtime is different, and most NVIDIA intuitions do not survive contact with Apple's unified memory architecture.

Ready when you are

Want to discuss this topic?

Start a Conversation