Perplexity introduces Search as Code, an agentic-search architecture that beats fixed search APIs on four of five benchmarks

ListenJun 1, 2026published Jun 4, 2026

Perplexity on June 1 published research introducing Search as Code (SaC), an architecture in which AI agents compose search pipelines programmatically from atomic building blocks rather than calling a fixed search endpoint. In Perplexity's own evaluations, an SaC-equipped agent running on GPT-5.5 topped four of five tested benchmarks against rival agentic-search systems including the OpenAI Responses API, Anthropic's Managed Agents on Opus 4.7, the Exa Agent, and Parallel Tasks.

What's new

A new architecture for agentic search. "Search itself must become agentic, with its building blocks accessible directly as SDKs within the agent harness."
A reframing of how models should interact with search. "Models should not merely call a search engine. They should be able to orchestrate the individual pieces of the search stack as the specific task demands."
A benchmark suite covering five tasks: DeepSearchQA (DSQA), BrowseComp, Humanity's Last Exam (HLE), WideSearch, and a newly introduced benchmark called WANDR.
Reported results: Perplexity's SaC implementation on GPT-5.5 leads four of the five benchmarks, with a particularly large lead on WANDR — 2.5x ahead of the next-best system.
A comparison set that is unusually frank by industry standards: the post evaluates SaC against the OpenAI Responses API (GPT-5.5), Anthropic Managed Agents (Opus 4.7), the Exa Agent, and Parallel Tasks.

Context

The dominant pattern in agentic systems today is to expose web search as a single high-level tool call: the model emits a query string, an opaque ranker returns a list of URLs or snippets, and the model continues. That pattern is what powers OpenAI's web-search tool, Anthropic's web-search and web-fetch tools, and most third-party retrieval middleware.

Perplexity's argument is that the high-level tool call leaves performance on the table. A real search stack has many movable pieces — query rewriting, source selection, snippet extraction, freshness filtering, deduplication, citation grounding — and by exposing those pieces as SDK primitives the model can choose how to assemble them per-query rather than accepting whatever the ranker decided. WANDR appears to be Perplexity's attempt to introduce a benchmark that specifically rewards this kind of orchestration, where existing benchmarks (BrowseComp, DSQA, HLE, WideSearch) come from prior agentic-search work.

The 2.5x lead on WANDR comes with the obvious caveat that WANDR is Perplexity's own benchmark. The more interesting result is that SaC also leads on three of the four pre-existing benchmarks the team did not author.

Why it matters

If the SaC framing holds up under independent replication, it changes who has leverage in the agentic-search stack. The fixed-API model concentrates value at whichever provider owns the search endpoint and ranker. An SDK-of-primitives model pushes that value down toward whoever can run a fast, well-componentized search infrastructure that agents want to orchestrate — which is the business Perplexity has been building.

It also reframes what application developers should be evaluating. The relevant question becomes less "which web-search tool API has the best snippets" and more "which provider exposes the right granularity of search primitives, and how good is my agent at composing them." That likely accelerates a split in the market between search providers offering opinionated single-call APIs and those offering low-level composable stacks.

For the major model labs, the post is a useful provocation. OpenAI's Responses API and Anthropic's Managed Agents are both moving the other direction — bundling more of the orchestration inside the platform. Perplexity is betting the opposite: that the next leg of agentic-search performance comes from giving the agent more control, not less.

Corroborating sources

Research.perplexity
https://research.perplexity.ai/articles/rethinking-search-as-code-generation
“Search itself must become agentic, with its building blocks accessible directly as SDKs within the agent harness.”

What's new

A new architecture for agentic search. "Search itself must become agentic, with its building blocks accessible directly as SDKs within the agent harness."

A reframing of how models should interact with search. "Models should not merely call a search engine. They should be able to orchestrate the individual pieces of the search stack as the specific task demands."

A benchmark suite covering five tasks: DeepSearchQA (DSQA), BrowseComp, Humanity's Last Exam (HLE), WideSearch, and a newly introduced benchmark called WANDR.

Reported results: Perplexity's SaC implementation on GPT-5.5 leads four of the five benchmarks, with a particularly large lead on WANDR — 2.5x ahead of the next-best system.

A comparison set that is unusually frank by industry standards: the post evaluates SaC against the OpenAI Responses API (GPT-5.5), Anthropic Managed Agents (Opus 4.7), the Exa Agent, and Parallel Tasks.

Context

Why it matters