Updated yesterday · 2026-04-15

AI Coding Model Leaderboard

Every coding model that matters, ranked by real benchmarks — not lab press releases. SWE-bench Verified is the primary signal; Aider Polyglot is the secondary. Pricing, context, and speed are straight from each provider.

Best overall

Top SWE-bench Verified

Claude Opus 4.6

Anthropic

80.8% SWE-bench77.5% Aider

Best value

70%+ at lowest output $

Gemini 3 Flash (high reasoning)

Google

75.8% SWE-bench71.0% Aider

Top on Aider

Polyglot edit leader

GPT-5 (high reasoning)

OpenAI

74.9% SWE-bench88.0% Aider
#1Anthropic
Claude Opus 4.6
Reasoning
SWE-bench
80.8%
Aider
77.5%
Context
1M
Speed
32/s
$/1M in
$15
$/1M out
$75

Highest SWE-bench Verified (80.8%). Expensive but shines on multi-file refactors.

Claude CodeCursorWindsurfReplitZed
#2OpenAI
GPT-5.2
Reasoning
SWE-bench
80.0%
Aider
87.5%
Context
400K
Speed
55/s
$/1M in
$2.5
$/1M out
$10

OpenAI's current flagship — 1.2pt behind Opus 4.6 on SWE-bench, ahead on Aider.

CursorWindsurfGitHub CopilotBolt.newv0
#3Anthropic
Claude Sonnet 4.6
Reasoning
SWE-bench
79.6%
Aider
76.8%
Context
1M
Speed
75/s
$/1M in
$3.0
$/1M out
$15

Best price-to-performance in the top tier. 1.2pt behind Opus at 1/5 the cost.

Claude CodeCursorWindsurfClineZed
#4Google
Gemini 3 Pro (high thinking)
Reasoning
SWE-bench
78.4%
Aider
82.5%
Context
2M
Speed
70/s
$/1M in
$1.5
$/1M out
$10

Just-released flagship — 78% SWE-bench at Sonnet-tier pricing. Worth trying.

CursorWindsurfClineAider
#5Anthropic
Claude Opus 4.5
Reasoning
SWE-bench
76.8%
Aider
72.0%
Context
200K
Speed
30/s
$/1M in
$15
$/1M out
$75

Superseded by 4.6. Only pick it if a tool does not yet support 4.6.

Claude CodeCursor
#6Google
Gemini 3 Flash (high reasoning)
Reasoning
SWE-bench
75.8%
Aider
71.0%
Context
2M
Speed
180/s
$/1M in
$0.30
$/1M out
$2.5

Best speed-to-quality ratio. 2M context lets it hold a whole repo in memory.

CursorWindsurfClineAider
#7OpenAI
GPT-5 (high reasoning)
Reasoning
SWE-bench
74.9%
Aider
88.0%
Context
400K
Speed
48/s
$/1M in
$2.5
$/1M out
$10

Superseded by 5.2 in Feb 2026 but still the Aider Polyglot leader at 88%.

CursorWindsurfGitHub CopilotBolt.newv0
#8OpenAI
O3-Pro (high)
Reasoning
SWE-bench
73.5%
Aider
84.9%
Context
200K
Speed
18/s
$/1M in
$20
$/1M out
$80

Strong on Aider polyglot but $146 per benchmark run. Niche: hard reasoning tasks.

CursorClaude CodeAider
#9OpenAI
GPT-5 (medium reasoning)
Reasoning
SWE-bench
72.1%
Aider
86.7%
Context
400K
Speed
62/s
$/1M in
$2.5
$/1M out
$10

Daily-driver setting for most people. Default in Cursor Plus.

CursorWindsurfCopilotBolt.newv0
#10Google
Gemini 2.5 Pro (32k thinking)
Reasoning
SWE-bench
71.0%
Aider
83.1%
Context
2M
Speed
90/s
$/1M in
$1.3
$/1M out
$10

Older but still top-4 on Aider. Cheap enough for continuous IDE use.

CursorWindsurfClineAiderBolt.new
#11OpenAI
O3 (high)
Reasoning
SWE-bench
69.0%
Aider
81.3%
Context
200K
Speed
28/s
$/1M in
$2.0
$/1M out
$8.0

Big value after OpenAI dropped the price 80% in Sept 2025.

CursorAiderCline
#12xAI
Grok-4 (high)
Reasoning
SWE-bench
68.5%
Aider
79.6%
Context
256K
Speed
52/s
$/1M in
$3.0
$/1M out
$15

Better than its reputation at coding. Occasional provider outages.

CursorWindsurf
#13DeepSeek
DeepSeek V3.2 (Reasoner)
OpenReasoning
SWE-bench
65.4%
Aider
74.2%
Context
128K
Speed
45/s
$/1M in
$0.14
$/1M out
$0.55

Open weights, $1.30 per Aider benchmark run. Cheapest path to 74% polyglot.

ClineAiderCursorWindsurf
#14Anthropic
Claude Haiku 4.5
SWE-bench
63.2%
Aider
61.5%
Context
200K
Speed
110/s
$/1M in
$1.0
$/1M out
$5.0

Fastest Claude. Use for autocomplete + small edits, not hard refactors.

Claude CodeCursorZed
#15Alibaba
Qwen 3 Coder 480B
Open
SWE-bench
59.8%
Aider
58.3%
Context
262K
Speed
38/s
$/1M in
$0.40
$/1M out
$1.6

Best open-weights coding model right now. Solid for self-hosting.

ClineAider

What the columns mean

SWE-bench Verified
500 real GitHub issues the model has to solve end-to-end. 70%+ is production-viable.
Aider Polyglot
225 Exercism exercises across C++, Go, Java, JavaScript, Python, Rust. Edit accuracy, not pattern-match.
Context
Max tokens the model reads per request. Bigger = can hold more of your codebase at once.
TPS
Output tokens per second. Reasoning models look slower because they burn tokens on internal thinking.
Tools
Products that actually expose this model. Changes often — missing names probably means the tool hasn't shipped it yet.

How we update this

Every Monday we re-read the public SWE-bench and Aider leaderboards and sync any new rows. When a lab ships a flagship model, we bump it the same week. If a score is "—", that benchmark hasn't tested the model yet; we'd rather say that than fill the cell with a guess.

Picked a model? Now pick the tool.

Most of these models are available across several tools. The tool shapes your workflow as much as the model shapes your output.

We use cookies for analytics. Learn more