Anthropic API: Advisor Tool Gets Output Cap, Refusal-Only Requests Now Free
Anthropic on June 2 shipped two developer-facing improvements to the Claude API: a new parameter for capping the advisor tool's output per call, and a billing change that eliminates charges on API requests that return a safety refusal without generating any content.
What's new
Advisor tool max_tokens parameter:
The advisor tool — which pairs a fast executor model with a high-intelligence advisor model to provide strategic guidance mid-generation — now supports a max_tokens parameter on the tool definition. According to the API release notes: "The advisor tool now supports a max_tokens parameter to cap the advisor model's output per call, reducing latency and output token cost for workloads that don't need full-length advisor responses."
To apply it, set tools[].max_tokens in the advisor tool definition. Full documentation is in the "Capping advisor output" section of the Claude API docs.
Free refusal requests:
API requests that return stop_reason: "refusal" with no generated output are now free. From the release notes: "On the Claude API, you are no longer billed for a request when it returns stop_reason: 'refusal' without Claude having generated any output."
The stop_details field on refusal responses returns a structured payload: a category (cyber, bio, or null) and a human-readable explanation. Applications can use this to route different classes of refusal to the appropriate handling logic.
Context
The advisor tool launched in April 2026 in public beta. Its core design is a multi-model orchestration pattern: a cheap, fast executor model generates most tokens while a more capable advisor model provides high-level guidance at key decision points. Without a cap on advisor output, verbose advisor responses can negate the cost savings of using an executor, particularly in high-throughput pipelines.
On the refusal billing side: API requests involve input token processing regardless of whether Claude generates output. For requests that Claude refuses entirely — producing zero output tokens — developers were previously charged for that input processing. At scale, applications operating in content domains that occasionally trigger safety filters were absorbing a nontrivial cost for interactions that provided no value.
Why it matters
Both changes target production deployment economics rather than capability. The max_tokens parameter gives teams precise control over the cost-latency tradeoff in multi-model pipelines without requiring prompt engineering workarounds to shorten advisor responses artificially.
The refusal billing fix is narrower, but meaningful for applications operating at scale in content domains that occasionally cross Claude's safety thresholds. The addition of structured stop_details is a practical bonus: applications can now programmatically distinguish between, say, a cyber-category refusal and a bio-category refusal, enabling more precise routing and logging.
Neither change requires a new beta header or integration update — they take effect automatically for existing API consumers.
Corroborating sources
- Platform.claude
https://platform.claude.com/docs/en/release-notes/api
“The advisor tool now supports a max_tokens parameter to cap the advisor model's output per call, reducing latency and output token cost for workloads that don't need full-length advisor responses.”