SWE-bench & software engineering
Compare SWE-bench Verified, SWE-Pro, SWE-Atlas, and Terminal-Bench scores for coding-focused model selection.
Compare GPT-5.5, Claude Opus, Gemini, Composer, DeepSeek, and frontier LLMs by SWE-bench, GPQA, MATH-500, HLE, Terminal-Bench, coding agent index, token pricing, and AskClash RWT scores. Compare GPT vs Claude vs Gemini, open-weight models, and the newest frontier releases in one benchmark table.
These ranked rows are cached from the live AskClash leaderboard. Open any model for SWE-bench breakdowns, benchmark cells, pricing, and comparison links.
| # | Model | Creator | Overall | SWE | Price in/out |
|---|---|---|---|---|---|
| 1 | Claude Mythos/Fable 5 | Anthropic | 96.7 | 95.5 | $10.0 / $50.0 |
| 2 | Claude Opus 4.8 (Adaptive) | Anthropic | 90.7 | 88.6 | $5.00 / $25.0 |
| 3 | GPT-5.5 xHigh | OpenAI | 87.3 | — | $5.00 / $30.0 |
| 4 | Claude Opus 4.7 (Adaptive) | Anthropic | 82.4 | 87.6 | $5.00 / $25.0 |
| 5 | GPT-5.4 xHigh | OpenAI | 73.8 | — | $2.50 / $15.0 |
| 6 | Gemini 3.5 Flash High | 69.7 | — | $1.50 / $9.00 | |
| 7 | Kimi K2.7 Code | Moonshot AI | 69.0 | 80.2 | $0.95 / $4.00 |
| 8 | Gemini 3.1 Pro | 66.9 | 80.6 | $2.00 / $12.0 | |
| 9 | Claude Opus 4.6 (Adaptive) | Anthropic | 66.8 | 80.8 | $5.00 / $25.0 |
| 10 | Kimi K2.6 Thinking | Moonshot AI | 63.2 | 80.2 | $0.95 / $4.00 |
| 11 | Qwen3.7 Max | Alibaba | 58.4 | 80.4 | $2.50 / $7.50 |
| 12 | Qwen3.7 Plus | Alibaba | 58.4 | 77.7 | $0.40 / $1.60 |
| 13 | Composer 2.5 | Cursor | 57.9 | — | $0.50 / $2.50 |
| 14 | MiniMax-M3 | MiniMax | 52.4 | 80.5 | $0.30 / $1.20 |
| 15 | DeepSeek V4 Pro (Max) | DeepSeek | 45.6 | 80.6 | $1.74 / $3.48 |
This page targets common LLM comparison queries: best coding LLM, SWE-bench leaderboard, GPT-5.5 vs Claude, Gemini benchmark scores, LLM API pricing comparison, and frontier model rankings.
Compare SWE-bench Verified, SWE-Pro, SWE-Atlas, and Terminal-Bench scores for coding-focused model selection.
Track GPQA, MATH-500, HLE, ARC-AGI-2, Tau2, and multimodal benchmarks like MMMU-Pro when providers publish them.
Real World Testing adds a hands-on quality signal on top of public benchmark tables so rankings reflect practical use, not only vendor cards.
Frontier proprietary models, adaptive variants, and major open-weight releases from OpenAI, Anthropic, Google, Meta, DeepSeek, Alibaba, Moonshot, Z.ai, and Cursor when benchmark data is available.
AskClash combines multiple public benchmark sources into one weighted table with pricing, context, access path, and AskClash RWT instead of showing a single vendor index alone.
Benchmark snapshots refresh from cached collector data. Reload the live leaderboard for the newest model rows and scores.
The interactive leaderboard loads in the browser. This crawlable page gives search engines stable copy for LLM leaderboard, benchmark comparison, and model ranking intent.