Loading…
Loading…
Current flagship reasoning/coding model with a usable 1M-token context, MIT open-weight (HF zai-org/GLM-5.2, ~753B params). docs.z.ai/guides/llm/glm-5.2 model id verbatim: "model": "glm-5.2" and overview: 'GLM-5.2 is a flagship model built for the era of long-horizon tasks. With truly usable 1M-toke
Every value carries a primary source and a verification date.
Sourced evaluation scores, each verified against its primary source.
GPQA Diamond
| GPQA-Diamond | 91.2 | 86.2 | 90.0 | 93.0 | 90.1 | 93.6 | 93.6 | 94.3 |
AIME 2026
| AIME 2026 | 99.2 | 95.3 | 97.0 | - | 94.6 | 95.7 | 98.3 | 98.2 |
HMMT Nov. 2025
| HMMT Nov. 2025 | 94.4 | 94.0 | 95.0 | 84.4 | 94.4 | 96.5 | 96.5 | 94.8 |
HMMT Feb. 2026
| HMMT Feb. 2026 | 92.5 | 82.6 | 97.1 | 84.4 | 95.2 | 96.7 | 96.7 | 87.3 |
HLE (Humanity's Last Exam)
| HLE | 40.5 | 31.0 | 41.4 | 37.0 | 37.7 | 49.8* | 41.4* | 45.0 |
HLE (w/ Tools)
| HLE w/ Tools | 54.7 | 52.3 | 53.5 | - | 48.2 | 57.9* | 52.2* | 51.4* |
CritPt
| CritPt | 20.9 | 4.6 | 13.4 | 3.7 | 12.9 | 20.9 | 27.1 | 17.7 |
IMOAnswerBench
| IMOAnswerBench | 91.0 | 83.8 | 90.0 | - | 89.8 | 83.5 | - | 81.0 |
SWE-bench Pro
| SWE-bench Pro | 62.1 | 58.4 | 60.6 | 59.0 | 55.4 | 69.2 | 58.6 | 54.2 |
Terminal-Bench 2.1 (Terminus-2)
| Terminal Bench 2.1 Terminus-2 | 81.0 | 63.5 | 75.0 | 65.0 | 64.0 | 85.0 | 84.0 | 74.0 |
FrontierSWE (Dominance as of 26/6/16)
| FrontierSWE Dominance as of 26/6/16 | 74.4 | 30.5 | - | - | 29.0 | 75.1 | 72.6 | 39.6 |
NL2Repo
| NL2Repo | 48.9 | 42.7 | 47.2 | 42.1 | 35.5 | 69.7 | 50.7 | 33.4 |
DeepSWE
| DeepSWE | 46.2 | 18.0 | 18.0 | 20.0 | 8.0 | 58.0 | 70.0 | 10.0 |
ProgramBench
| ProgramBench | 63.7 | 50.9 | - | - | 47.8 | 71.9 | 70.8 | 39.5 |
MCP-Atlas (Public Set)
| MCP-Atlas Public Set | 76.8 | 71.8 | 76.4 | 74.2 | 73.6 | 77.8 | 75.3 | 69.2 |