Skip to content
All guides
Getting started

How to choose a frontier AI model

Start with the job, not the leaderboard

Picking a frontier model is a procurement decision, not a fandom. The honest first question is not which model is smartest in the abstract. It is which model clears the bar your task needs at the lowest cost and latency you can live with. A model that is marginally better on a benchmark but four times the price is the wrong choice for a task the cheaper model already handles.

Across the providers ModelDex tracks today, the practical menu sorts into three tiers: small and cheap, mid-tier workhorses, and flagship reasoners. Most production systems end up using more than one of these, routed by task. The skill is matching the tier to the work.

Tier one: small, fast, cheap

These models are built for high volume and tight latency. They handle classification, extraction, routing, short rewrites, and the first pass of a multi step pipeline. You reach for them when you are calling the model thousands of times and every cent of per call cost compounds.

In our dataset, the clearest examples are GPT-5.4 nano at $0.20 per million input tokens and $1.25 per million output, GPT-5.4 mini at $0.75 input and $4.50 output, Claude Haiku 4.5 at $1 input and $5 output, Gemini 2.5 Flash at $0.30 input and $2.50 output, and Gemini 3 Flash at $0.50 input and $3.00 output. Grok Build 0.1 sits here too at $1.00 input and $2.00 output. The trade is real: these models reason less deeply than the flagships and have smaller working memory. Haiku 4.5 carries a 200,000 token context window and the GPT-5.4 mini and nano models carry 400,000, versus the roughly one million token windows on the larger models.

Tier two: mid-tier workhorses

This is where most general purpose application work lands. Strong reasoning, large context, sane prices. Claude Sonnet 4.6 at $3 input and $15 output pairs a one million token context window with extended and adaptive thinking. GPT-5.4 at $2.50 input and $15 output carries a 1,050,000 token window and the full reasoning effort dial. Gemini 3.5 Flash, which Google ships as a flagship-grade Flash model, is $1.50 input and $9.00 output with a context window just over one million tokens. Grok 4.3 at $1.25 input and $2.50 output is notably cheap for a flagship-class model and also carries a one million token window.

If you are not sure where to start, start here. A mid-tier model will handle the large majority of real workloads, and you only move up or down once you have evidence that you need to.

Tier three: flagship reasoners

These are the models you bring in for the hardest reasoning, the longest chains of tool calls, the work where a wrong answer is expensive. Claude Opus 4.8 at $5 input and $25 output supports adaptive thinking and a one million token context window. GPT-5.5 at $5 input and $30 output carries a 1,050,000 token window and the full effort dial. For the most demanding reasoning, GPT-5.5 Pro exists at $30 input and $180 output. That output price is the single most important number in this guide: at $180 per million output tokens, GPT-5.5 Pro is roughly six times the output cost of GPT-5.5 and thirty-six times Claude Sonnet 4.6. It is a scalpel, not a default.

The four questions that actually decide it

How hard is the reasoning. If the task is extraction or routing, a tier-one model is plenty. If it is multi step planning or hard math and code, you want a reasoning model and probably a flagship.

How much context does one call need. If you are stuffing a large codebase or a long document set into a single prompt, you need the headroom. Most of the larger models here sit near one million tokens. Smaller and cheaper models give you 200,000 to 400,000, which is still a lot, but check it against your real inputs.

How sensitive is the cost. Multiply your expected input and output tokens per call by the per million prices above, then multiply by call volume. Do this before you fall in love with a model. The output price dominates for generative work because output tokens are billed several times higher than input across every provider here.

What latency can you tolerate. Reasoning models think before they answer, which adds time and tokens. For interactive, user facing features, a faster mid-tier model often beats a slower flagship on the experience even when the flagship scores higher.

A simple default policy

Route the cheap, high volume work to a tier-one model. Make a strong mid-tier model your general default. Escalate to a flagship only for the specific calls that need it, and turn reasoning effort up only on those. This three tier routing is how teams get flagship quality where it matters without paying flagship prices on every call.

How to read the rest of ModelDex

Every number in this guide comes from the same verified dataset that powers the model pages. Open any model on ModelDex to see its context window, its input and output prices, its reasoning support, and the primary source behind each figure. Pricing and limits change when providers update their docs, and our figures track those provider docs directly, so always treat the live model page as the current truth.