P
PiazBench
Back to home

Reasoning

How well can each AI model solve complex logical and mathematical problems? Ranked by reasoning Arena Elo.

RankModelScore
1
Google
Gemini 3.5 Flashgoogle/gemini-3.5-flash
1527.0
2
Anthropic
Claude Opus 4.6anthropic/claude-opus-4-6-thinking
1513.0
3
OpenAI
GPT-5.4openai/gpt-5.4-high
1509.0
4
Alibaba
Qwen 3.7 Maxalibaba/qwen3.7-max-preview
1498.0
5
Google
Gemini 3.1 Progoogle/gemini-3.1-pro-preview
1497.0
6
Anthropic
Claude Opus 4.7anthropic/claude-opus-4-7-thinking
1494.0
7
X
Xiaomi: MiMo V2.5 Proxiaomi/mimo-v2.5-pro
1486.0
8
B
Ernie 5.1baidu/ernie-5.1
1481.0
9
OpenAI
GPT-5.5openai/gpt-5.5
1481.0
10
Alibaba
qwen3.6 maxalibaba/qwen3.6-max-preview
1479.0
11
Z
GLM 5.1zai/glm-5.1
1477.0
12
Google
Gemini 3 Progoogle/gemini-3-pro
1476.0
13
Alibaba
Qwen 3.5 Maxalibaba/qwen3.5-max-preview
1476.0
14
Google
Gemini 3 Flashgoogle/gemini-3-flash
1474.0
15
M
kimi k2.6moonshot/kimi-k2.6
1472.0
16
M
kimi k2.5moonshot/kimi-k2.5-thinking
1471.0
17
Google
Gemma 4 26B A4Bgoogle/gemma-4-26b-a4b
1468.0
18
DeepSeek
DeepSeek V4 Prodeepseek/deepseek-v4-pro-thinking
1467.0
19
Google
Gemma 4 31Bgoogle/gemma-4-31b
1464.0
20
xAI
Grok 4.20xai/grok-4.20-beta-0309-reasoning
1461.0
21
Anthropic
Claude Opus 4.5anthropic/claude-opus-4-5-20251101
1461.0
22
Anthropic
Claude Sonnet 4.6anthropic/claude-sonnet-4-6
1455.0
23
Meta
Muse Sparkmeta-llama/muse-spark
1451.0
24
Alibaba
qwen3.6 plusalibaba/qwen3.6-plus
1451.0
25
Google
Gemini 2.5 Progoogle/gemini-2.5-pro
1450.0
26
Google
gemini 3 flashgoogle/gemini-3-flash (thinking-minimal)
1449.0
27
Alibaba
Qwen 3 Maxalibaba/qwen3-max-preview
1449.0
28
Alibaba
qwen3.5 397b a17balibaba/qwen3.5-397b-a17b
1448.0
29
X
mimo v2 proxiaomi/mimo-v2-pro
1448.0
30
Anthropic
Claude Sonnet 4.5anthropic/claude-sonnet-4-5-20250929-thinking-32k
1447.0

312 models tested