Loading…
Loading…
Prior-generation GLM-5 flagship, still served on the API and listed on z.ai/model-api. docs.z.ai/guides/llm/glm-5.1 model id verbatim: "model": "glm-5.1" and overview: 'GLM-5.1 is Z.AI's latest flagship model, designed for long-horizon tasks. It can work continuously and autonomously on a single tas
Every value carries a primary source and a verification date.
Sourced evaluation scores, each verified against its primary source.
SWE-Bench Pro
| SWE-Bench Pro | 58.4 | 55.1 | 56.6 | 56.2 | - | 53.8 | 57.3 | 54.2 | 57.7 |
GPQA Diamond
| GPQA-Diamond | 86.2 | 86.0 | 90.4 | 87.0 | 82.4 | 87.6 | 91.3 | 94.3 | 92.0 |
AIME 2026
| AIME 2026 | 95.3 | 95.4 | 95.1 | 89.8 | 95.1 | 94.5 | 95.6 | 98.2 | 98.7 |
HMMT Feb 2026
| HMMT Feb. 2026 | 82.6 | 82.8 | 87.8 | 72.7 | 79.9 | 81.3 | 84.3 | 87.3 | 91.8 |
HMMT Nov 2025
| HMMT Nov. 2025 | 94.0 | 96.9 | 94.6 | 81.0 | 90.2 | 91.1 | 96.3 | 94.8 | 95.8 |
HLE
| HLE | 31.0 | 30.5 | 28.8 | 28.0 | 25.1 | 31.5 | 36.7 | 45.0 | 39.8 |
HLE w/ Tools
| HLE w/ Tools | 52.3 | 50.4 | 50.6 | - | 40.8 | 51.8 | 53.1* | 51.4* | 52.1* |
IMOAnswerBench
| IMOAnswerBench | 83.8 | 82.5 | 83.8 | 66.3 | 78.3 | 81.8 | 75.3 | 81.0 | 91.4 |
NL2Repo
| NL2Repo | 42.7 | 35.9 | 37.9 | 39.8 | - | 49.8 | 33.4 | 41.3 |
Terminal-Bench 2.0 (Terminus-2)
| Terminal-Bench 2.0 Terminus-2 | 63.5 | 56.2 | 61.6 | - | 39.3 | 50.8 | 65.4 | 68.5 | - |
CyberGym
| CyberGym | 68.7 | 48.3 | - | - | 17.3 | 41.3 | 66.6 | 38.8 | 66.3 |
BrowseComp
| BrowseComp | 68.0 | 62.0 | - | - | 51.4 | 60.6 | - | - | - |
BrowseComp w/ Context Manage
| BrowseComp w/ Context Manage | 79.3 | 75.9 | - | - | 67.6 | 74.9 | 84.0 | 85.9 | 82.7 |
MCP-Atlas (Public Set)
| MCP-Atlas Public Set | 71.8 | 69.2 | 74.1 | 48.8 | 62.2 | 63.8 | 73.8 | 69.2 | 67.2 |
Tool-Decathlon
| Tool-Decathlon | 40.7 | 38.0 | 39.8 | 46.3 | 35.2 | 27.8 | 47.2 | 48.8 | 54.6 |