LLM Leaderboard

LLM leaderboard for SWE-bench, coding agents, benchmarks, and API pricing.

Compare GPT-5.5, Claude Opus, Gemini, Composer, DeepSeek, and frontier LLMs by SWE-bench, GPQA, MATH-500, HLE, Terminal-Bench, coding agent index, token pricing, and AskClash RWT scores. Compare GPT vs Claude vs Gemini, open-weight models, and the newest frontier releases in one benchmark table.

#1 Claude Mythos/Fable 5Current overall leader with score 96.7 across public benchmark cells and AskClash RWT.
SWE-bench & coding agentsTrack SWE-bench, SWE-Pro, SWE-Atlas, Terminal-Bench, and Artificial Analysis coding agent index scores side by side.
LLM pricing & contextSee input/output token pricing, context window, and billing access path for each model row.

Top LLM benchmark rankings

These ranked rows are cached from the live AskClash leaderboard. Open any model for SWE-bench breakdowns, benchmark cells, pricing, and comparison links.

#ModelCreatorOverallSWEPrice in/out
1Claude Mythos/Fable 5Anthropic96.795.5$10.0 / $50.0
2Claude Opus 4.8 (Adaptive)Anthropic90.788.6$5.00 / $25.0
3GPT-5.5 xHighOpenAI87.3$5.00 / $30.0
4Claude Opus 4.7 (Adaptive)Anthropic82.487.6$5.00 / $25.0
5GPT-5.4 xHighOpenAI73.8$2.50 / $15.0
6Gemini 3.5 Flash HighGoogle69.7$1.50 / $9.00
7Kimi K2.7 CodeMoonshot AI69.080.2$0.95 / $4.00
8Gemini 3.1 ProGoogle66.980.6$2.00 / $12.0
9Claude Opus 4.6 (Adaptive)Anthropic66.880.8$5.00 / $25.0
10Kimi K2.6 ThinkingMoonshot AI63.280.2$0.95 / $4.00
11Qwen3.7 MaxAlibaba58.480.4$2.50 / $7.50
12Qwen3.7 PlusAlibaba58.477.7$0.40 / $1.60
13Composer 2.5Cursor57.9$0.50 / $2.50
14MiniMax-M3MiniMax52.480.5$0.30 / $1.20
15DeepSeek V4 Pro (Max)DeepSeek45.680.6$1.74 / $3.48

Benchmarks and search topics covered

This page targets common LLM comparison queries: best coding LLM, SWE-bench leaderboard, GPT-5.5 vs Claude, Gemini benchmark scores, LLM API pricing comparison, and frontier model rankings.

SWE-bench & software engineering

Compare SWE-bench Verified, SWE-Pro, SWE-Atlas, and Terminal-Bench scores for coding-focused model selection.

Reasoning & knowledge

Track GPQA, MATH-500, HLE, ARC-AGI-2, Tau2, and multimodal benchmarks like MMMU-Pro when providers publish them.

AskClash RWT

Real World Testing adds a hands-on quality signal on top of public benchmark tables so rankings reflect practical use, not only vendor cards.

Frequently asked questions

Which models are on the leaderboard?

Frontier proprietary models, adaptive variants, and major open-weight releases from OpenAI, Anthropic, Google, Meta, DeepSeek, Alibaba, Moonshot, Z.ai, and Cursor when benchmark data is available.

Is this the same as Artificial Analysis or LMSYS?

AskClash combines multiple public benchmark sources into one weighted table with pricing, context, access path, and AskClash RWT instead of showing a single vendor index alone.

How often do rankings update?

Benchmark snapshots refresh from cached collector data. Reload the live leaderboard for the newest model rows and scores.

The interactive leaderboard loads in the browser. This crawlable page gives search engines stable copy for LLM leaderboard, benchmark comparison, and model ranking intent.