ResearchAnthropicVerified

Anthropic research finds deterministic retrieval tools are the key bottleneck for AI agents in biology, not model capability

No audio yetJun 8, 2026published Jun 14, 2026

Anthropic published research on June 8, 2026, examining why AI agents struggle to reliably retrieve biological data — and how a single deterministic tool layer nearly eliminates the performance gap between models. The paper introduces gget virus, a tool for querying the NCBI Virus database, and shows it raises accuracy from a variable 16.9%–91.3% range to greater than 90% across all models tested.

What's new

The researchers evaluated five AI systems on VirBench, a benchmark of 120 realistic viral sequence queries with manually verified answers:

Models tested: Claude Sonnet 4, Claude Opus 4.7, Biomni OSS, Edison Analysis, and GPT models.
Without deterministic tooling: Accuracy ranged from 16.9% to 91.3% depending on model and query type — large variability that makes agents unreliable for research use.
With gget virus: All models achieved greater than 90% accuracy, regardless of which model was used.
Key architectural finding: "Adding a deterministic retrieval layer made model choice much less important" — because standardized access to the underlying database eliminates the run-to-run variability caused by models interpreting complex database schemas themselves.

The paper introduces VirBench as a public benchmark and gget virus as an open tool, both designed to help the field build more reliable biological agents.

Context

Biological databases like NCBI Virus are designed for human researchers navigating structured interfaces. AI agents querying these systems face a compounding challenge: variable schema interpretation, inconsistent output formats, and databases not structured for machine consumption at the query patterns agents naturally generate.

The gget ecosystem (the broader toolkit this work extends) has been used in genomics and bioinformatics for deterministic data retrieval. Adding virus sequence capability extends it to a domain where AI-assisted research is accelerating — particularly in surveillance, evolutionary biology, and therapeutic development.

Why it matters

The finding reframes what it means to deploy AI agents in scientific research. The intuitive expectation is that more capable models will produce more reliable results — but this study shows that for structured data retrieval tasks, the infrastructure layer matters more than model tier. "The bottleneck for biological agents is not only reasoning but the absence of widespread deterministic execution layers for querying biological data."

For practitioners building AI pipelines in biology, the implication is practical: before upgrading the model, audit whether the tools agents are calling can deliver consistent, verifiable outputs. A Claude Sonnet 4 with a well-designed retrieval tool outperforms Claude Opus 4.7 without one on these benchmarks.

More broadly, the research points to a gap in how the AI community thinks about agent reliability in science. The dominant discussion focuses on reasoning capability; this work argues that data infrastructure designed for agent use — what the authors call "biological data infrastructure that agents can navigate as reliably as humans do" — may be the more pressing bottleneck.

Corroborating sources

Anthropic
https://www.anthropic.com/research/agents-in-biology
“The bottleneck for biological agents is not only reasoning but the absence of widespread deterministic execution layers for querying biological data.”

What's new

The researchers evaluated five AI systems on VirBench, a benchmark of 120 realistic viral sequence queries with manually verified answers:

Models tested: Claude Sonnet 4, Claude Opus 4.7, Biomni OSS, Edison Analysis, and GPT models.

Without deterministic tooling: Accuracy ranged from 16.9% to 91.3% depending on model and query type — large variability that makes agents unreliable for research use.

With gget virus: All models achieved greater than 90% accuracy, regardless of which model was used.

Key architectural finding: "Adding a deterministic retrieval layer made model choice much less important" — because standardized access to the underlying database eliminates the run-to-run variability caused by models interpreting complex database schemas themselves.

The paper introduces VirBench as a public benchmark and gget virus as an open tool, both designed to help the field build more reliable biological agents.

Context

Why it matters