AI Coding Model Leaderboard
Every coding model that matters, ranked by real benchmarks — not lab press releases. SWE-bench Verified is the primary signal; Aider Polyglot is the secondary. Pricing, context, and speed are straight from each provider.
Best overall
Top SWE-bench Verified
Claude Opus 4.6
Anthropic
Best value
70%+ at lowest output $
Gemini 3 Flash (high reasoning)
Top on Aider
Polyglot edit leader
GPT-5 (high reasoning)
OpenAI
Highest SWE-bench Verified (80.8%). Expensive but shines on multi-file refactors.
OpenAI's current flagship — 1.2pt behind Opus 4.6 on SWE-bench, ahead on Aider.
Best price-to-performance in the top tier. 1.2pt behind Opus at 1/5 the cost.
Just-released flagship — 78% SWE-bench at Sonnet-tier pricing. Worth trying.
Superseded by 4.6. Only pick it if a tool does not yet support 4.6.
Best speed-to-quality ratio. 2M context lets it hold a whole repo in memory.
Superseded by 5.2 in Feb 2026 but still the Aider Polyglot leader at 88%.
Strong on Aider polyglot but $146 per benchmark run. Niche: hard reasoning tasks.
Daily-driver setting for most people. Default in Cursor Plus.
Older but still top-4 on Aider. Cheap enough for continuous IDE use.
Big value after OpenAI dropped the price 80% in Sept 2025.
Better than its reputation at coding. Occasional provider outages.
Open weights, $1.30 per Aider benchmark run. Cheapest path to 74% polyglot.
Fastest Claude. Use for autocomplete + small edits, not hard refactors.
Best open-weights coding model right now. Solid for self-hosting.
What the columns mean
- SWE-bench Verified
- 500 real GitHub issues the model has to solve end-to-end. 70%+ is production-viable.
- Aider Polyglot
- 225 Exercism exercises across C++, Go, Java, JavaScript, Python, Rust. Edit accuracy, not pattern-match.
- Context
- Max tokens the model reads per request. Bigger = can hold more of your codebase at once.
- TPS
- Output tokens per second. Reasoning models look slower because they burn tokens on internal thinking.
- Tools
- Products that actually expose this model. Changes often — missing names probably means the tool hasn't shipped it yet.
How we update this
Every Monday we re-read the public SWE-bench and Aider leaderboards and sync any new rows. When a lab ships a flagship model, we bump it the same week. If a score is "—", that benchmark hasn't tested the model yet; we'd rather say that than fill the cell with a guess.
Picked a model? Now pick the tool.
Most of these models are available across several tools. The tool shapes your workflow as much as the model shapes your output.