Loading…
Loading…
Plain-language definitions for the terms that show up across model specs and pricing pages. No jargon for its own sake — just what each one means when you're comparing models.
The maximum amount of text a model can consider at once, measured in tokens. It covers everything in a single request — your prompt, any documents you attach, and the model's own response. Larger windows let a model reason over more material without losing track of earlier parts.
The units a model reads and writes. A token is a chunk of text — roughly four characters, or about three-quarters of a word in English. Pricing and context limits are counted in tokens, not words, so a 1,000-word document is closer to 1,300 tokens.
The date after which a model has no built-in knowledge from its training data. Events, releases, and facts that emerged after the cutoff are unknown to the model unless they're supplied at request time through tools or retrieval.
What a model costs to run, quoted per million tokens (Mtok). Input is what you send to the model; output is what it generates back. Output is almost always priced higher than input, and on most models the two rates differ significantly.
A mode where the model spends extra compute working through a problem step by step before answering. It improves accuracy on hard math, coding, and logic tasks, but adds latency and consumes more tokens — so it costs more and takes longer than a direct response.
The ability for a model to call external functions or APIs you define, rather than only producing text. The model decides when a tool is needed, returns a structured request to invoke it, and incorporates the result. This is how models fetch live data, run code, or take actions.
Support for inputs beyond text. A multimodal model can read images — and sometimes audio or video — and reason about them alongside a text prompt. Vision specifically refers to image understanding, such as describing a chart or extracting text from a screenshot.
A provider's most capable current model — the one that defines the top of its lineup. Flagships are typically the most expensive and highest-latency option, traded off against smaller, faster, cheaper models for routine work.
The internal values a model learns during training, often counted in billions. Loosely, more parameters mean more capacity to learn patterns. It's a rough proxy for capability, not a guarantee — architecture and training data matter as much as raw count.
Further training of an existing model on your own examples to specialize its behavior — a particular tone, format, or domain. It adapts the model's weights, unlike prompting, which only changes the instructions you give at request time.
A pattern where relevant documents are fetched from an external store and inserted into the prompt so the model can answer using current or private information. RAG keeps answers grounded in real sources without retraining the model.
An architecture that splits a model into many specialized sub-networks (experts) and activates only a few per token. This lets a model hold a very large total parameter count while keeping the compute cost of any single request comparatively low.
How long a model takes to respond. It's often split into time-to-first-token (how fast output starts) and overall generation speed. Reasoning modes and larger models tend to raise latency; smaller models and streaming reduce the perceived wait.
A cap on how much you can call a model in a given window, usually expressed as requests per minute or tokens per minute. Limits scale with your usage tier and protect shared capacity; exceeding them returns an error until the window resets.