Loading…
Loading…
Text-only instruction-tuned workhorse still served on the Llama API. Confirmed current on llama.developer.meta.com/docs/models: "Model ID: `Llama-3.3-70B-Instruct`" — "A text-only instruction-tuned model with enhanced performance relative to Llama 3.1 70B, and relative to Llama 3.2 90B when used for
Every value carries a primary source and a verification date.
Sourced evaluation scores, each verified against its primary source.
MMLU (CoT, 0-shot)
MMLU (CoT) | 0 | macro\_avg/acc | 73.0 | 86.0 | 86.0 | 88.6
MMLU Pro (CoT, 5-shot)
MMLU Pro (CoT) | 5 | macro\_avg/acc | 48.3 | 66.4 | 68.9 | 73.3
IFEval
IFEval | | | 80.4 | 87.5 | 92.1 | 88.6
GPQA Diamond (CoT, 0-shot)
GPQA Diamond (CoT) | 0 | acc | 31.8 | 48.0 | 50.5 | 49.0
HumanEval (0-shot)
HumanEval | 0 | pass@1 | 72.6 | 80.5 | 88.4 | 89.0
MBPP EvalPlus (base, 0-shot)
MBPP EvalPlus (base) | 0 | pass@1 | 72.8 | 86.0 | 87.6 | 88.6
MATH (CoT, 0-shot)
MATH (CoT) | 0 | sympy\_intersection\_score | 51.9 | 68.0 | 77.0 | 73.8
BFCL v2 (0-shot)
BFCL v2 | 0 | overall\_ast\_summary/macro\_avg/valid | 65.4 | 77.5 | 77.3 | 81.1
MGSM (0-shot)
MGSM | 0 | em | 68.9 | 86.9 | 91.1 | 91.6