Evaluation Suite
Benchmarks
Independent evaluation scores across knowledge, reasoning, math, coding and multimodal tasks — leaders in each column are highlighted.
Independent evaluation scores across knowledge, reasoning, math, coding and multimodal tasks — leaders in each column are highlighted.
| # | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 ProGoogle | 94.30% (column leader) | 80.60% | — | — | 44.70% | — | 51.40% | 54.20% | 80.50% | 77.10% (column leader) | — |
| 2 | Claude Opus 4.7Anthropic | 94.20% | 87.60% | — | — | — | — | 54.70% | 64.30% | — | — | — |
| 3 | Claude Opus 4.8Anthropic | 93.60% | 88.60% (column leader) | — | — | 45.70% (column leader) | — | 57.90% (column leader) | 69.20% (column leader) | — | — | 74.60% |
| 4 | GPT-5.5OpenAI | 93.60% | — | — | — | 44.30% | — | 52.20% | 58.60% | — | — | 78.20% (column leader) |
| 5 | GPT-5.4OpenAI | 92.80% | — | — | — | — | — | 39.80% | 57.70% | — | — | — |
| 6 | Gemini 3.5 FlashGoogle | 92.20% | — | — | — | 40.20% | — | — | 55.10% | 83.60% (column leader) | 72.10% | — |
| 7 | Gemini 3.5 ProGoogle | 91.80% | 78.40% | 94.50% | 89.70% (column leader) | 32.60% | 81.30% (column leader) | — | — | 82.10% | — | — |
| 8 | Gemini 3 FlashGoogle | 90.40% | 78% | — | — | 33.70% | — | 43.50% | — | 81.20% | — | — |
| 9 | Grok 4.3xAI | 90.10% | — | — | — | — | — | — | — | — | — | — |
| 10 | Claude Sonnet 4.6Anthropic | 89.90% | 79.60% | — | — | — | — | 49% | — | — | — | — |
| 11 | GPT-5 ProOpenAI | 89.60% | 78.40% | 96.50% (column leader) | 88.20% | — | 81.30% (column leader) | 42.10% | — | — | — | — |
| 12 | Gemini 3 ProGoogle | 88.90% | 74.20% | 91.20% | 87.30% | 28.40% | 76.80% | — | — | 78.60% | — | — |
| 13 | GPT-5.4 miniOpenAI | 88% | — | — | — | — | — | — | — | — | — | — |
| 14 | Gemini 2.5 ProGoogle | 86.40% | 59.60% | 88% | — | 21.60% | — | — | — | — | — | — |
| 15 | Grok 4.2xAI | 86% | 73% | 93% | 87% | 24% | — | 40% | — | — | 15.50% | — |
| 16 | Claude Opus 4.6Anthropic | 85.40% | 79.80% | 90.50% | 87.90% | 18.60% | — | — | — | — | — | 51.20% |
| 17 | Claude Opus 4.5Anthropic | 83.10% | 76.40% | 87% | 86.20% | 15.90% | — | — | — | — | — | — |
| 18 | Grok 4.1xAI | 83% | 70% | 90% | 85.50% | 22% | — | — | — | — | — | — |
| 19 | Gemini 2.5 FlashGoogle | 82.80% | 60.40% | 72% | — | 11% | — | — | — | — | — | — |
| 20 | 82.50% | 68% | 88% | 84.50% | 19.50% | 72% | — | — | — | — | — | |
| 21 | GPT-5.5 miniOpenAI | 82.40% | 72.60% | 91% | 84.50% | — | 74.80% | — | — | — | — | — |
| 22 | Claude Opus 4.1Anthropic | 80.90% | 74.50% | 78% | 83.50% | 11.20% | — | — | — | — | — | — |
| 23 | Claude Sonnet 4.5Anthropic | 80.60% | 73.10% | 84.20% | 84% | 12.80% | — | — | — | — | — | — |
| 24 | Grok 4xAI | 80% | 65% | 86% | 83% | 19% | — | — | — | — | — | — |
| 25 | GPT-5.3OpenAI | 79.10% | 68.90% | 88.40% | 82.70% | — | 70.20% | — | — | — | — | — |
| 26 | Claude Sonnet 4.4Anthropic | 77.90% | 69.50% | 79.60% | 81.70% | — | — | — | — | — | — | — |
| 27 | Grok 4.20xAI | 77.60% | — | — | — | — | — | — | — | — | — | — |
| 28 | GPT-5.2OpenAI | 76.30% | 65.10% | 85% | 80.90% | — | — | — | — | — | — | — |
| 29 | Grok 4 FastxAI | 75% | 58% | 78% | 79.50% | — | 63% | — | — | — | — | — |
| 30 | GPT-5.1OpenAI | 73.50% | 61.70% | 82.60% | 79.10% | — | — | — | — | — | — | — |
| 31 | GPT-5.2 miniOpenAI | 71.80% | 58.40% | 80.20% | 77.50% | — | — | — | — | — | — | — |
| 32 | GPT-5OpenAI | 70.20% | 56.80% | 78.50% | 76.40% | — | — | — | — | — | — | — |
| 33 | GPT-5.5 nanoOpenAI | 68.90% | 54.20% | 76.80% | 74.10% | — | — | — | — | — | — | — |
| 34 | Claude Haiku 4.4Anthropic | 68.40% | 58.20% | 66.10% | 75.30% | — | — | — | — | — | — | — |
| 35 | Grok 3xAI | 68% | — | 52% | 79% | — | 55% | — | — | — | — | — |
| 36 | Gemini 2.5 Flash-LiteGoogle | 64.50% | 41.30% | 68.90% | 76.20% | — | 52.40% | — | — | — | — | — |
| 37 | Gemini 2.0 ProGoogle | 62.10% | 45.80% | 58.30% | 78.90% | — | 49.60% | — | — | — | — | — |
| 38 | Gemini 2.0 FlashGoogle | 54.70% | 34.20% | 49.10% | 71.60% | — | 40.80% | — | — | — | — | — |
| 39 | Gemini 1.5 ProGoogle | 46.20% | 22.50% | 31.70% | 68.40% | — | 34.90% | — | — | — | — | — |
| 40 | GPT-5.5 ProOpenAI | — | — | — | — | — | — | 57.20% | — | — | — | — |
| 41 | Claude Haiku 4.5Anthropic | — | 73.30% | — | — | — | — | — | — | — | — | — |
| 42 | — | 70.80% | — | — | — | — | — | — | — | — | — |
Every score is sourced from a primary publication (provider report or independent eval) and carries a verification date — click a model for its full sourced spec. Leaders in each column are highlighted; a “—” means no verified result for that suite yet.