ResearchAnthropicVerified

Anthropic engineering details how it contains Claude agents across claude.ai, Claude Code, and Claude Cowork

ListenMay 25, 2026published Jun 4, 2026

Anthropic's engineering team published a retrospective on 2026-05-25 describing how it contains Claude agents across three products — claude.ai, Claude Code, and Claude Cowork — with a focus on environment-layer isolation rather than model-layer behavior steering. The post outlines specific containment patterns the team uses in production, calls out a set of security failures it has hit, and codifies a set of principles it now applies to new agent surfaces.

What's new

Anthropic states its core operating principle bluntly: "Design for containment at the environment layer first, then steer behavior at the model layer."
The post enumerates the containment primitives Anthropic uses in production: ephemeral containers, human-in-the-loop sandboxes, and virtual machines. Each maps to a different agent product and trust posture.
Anthropic frames the rationale in terms of capability rather than supervision: "Rather than supervising what the agent does, we supervise what it's able to do by enforcing access boundaries through, for example, sandboxes, virtual machines, and egress controls."
The team documents an explicit pattern for unattended use in Claude Code: "Claude Code's reference devcontainer exists precisely so that the agent can run unattended, without per-action approvals." Tight perimeters, in their phrasing, are what buys you the option to relax supervision.
The post argues for leaning on mature infrastructure rather than custom code: "Battle-tested hypervisors, syscall filters, and container runtimes have survived more adversarial attention than anything you'll build," with the corollary that "the weakest layer is the one you built yourself."
For credential handling specifically, Anthropic describes a scoped-token pattern: "Credentials stay in the host keychain, the VM gets a per-session scoped-down token, and that token can be revoked independently of the user's."

Context

Anthropic's three named agent surfaces correspond to very different risk profiles. claude.ai serves the general public with low blast radius; Claude Code runs on developer machines with broad filesystem and shell access; Claude Cowork operates inside enterprise environments with access to corporate data and tooling. The post is unusual among lab engineering write-ups because it explicitly maps containment primitives onto each surface rather than presenting a single one-size-fits-all approach.

The write-up also follows a year of incidents across the agent ecosystem in which prompt-injection or tool-use mishandling caused agents to exfiltrate data, modify files outside intended scopes, or chain to external APIs in ways their developers did not anticipate. Anthropic's containment-first thesis is a direct response: don't try to make the model perfectly trustworthy at the behavior layer; build the environment such that even an adversarially-steered agent can't do real harm.

Why it matters

The engineering substance here is the more interesting half of the post. Each lab now ships agents, but very few publish how they isolate them. Anthropic naming hypervisors, syscall filters, and per-session VM tokens as core primitives — and explicitly preferring them over model-layer fine-tuning — is a useful counterweight to the louder conversation about RLHF and constitutional AI as the primary safety tools.

For downstream builders, the post functions as a reference architecture. Teams shipping agent products on Claude, OpenAI, or Gemini can lift the patterns directly: ephemeral containers for low-trust workloads, sandboxed VMs for code-execution agents, scoped credential tokens that can be revoked independently of the user. The fact that Anthropic flags "the weakest layer is the one you built yourself" is also a tactical hint — enterprise security teams should resist the temptation to roll bespoke sandboxes and instead inherit hardened container/VM runtimes.

Corroborating sources

Anthropic
https://www.anthropic.com/engineering/how-we-contain-claude
“Design for containment at the environment layer first, then steer behavior at the model layer.”

What's new

Anthropic states its core operating principle bluntly: "Design for containment at the environment layer first, then steer behavior at the model layer."

The post enumerates the containment primitives Anthropic uses in production: ephemeral containers, human-in-the-loop sandboxes, and virtual machines. Each maps to a different agent product and trust posture.

Anthropic frames the rationale in terms of capability rather than supervision: "Rather than supervising what the agent does, we supervise what it's able to do by enforcing access boundaries through, for example, sandboxes, virtual machines, and egress controls."

The team documents an explicit pattern for unattended use in Claude Code: "Claude Code's reference devcontainer exists precisely so that the agent can run unattended, without per-action approvals." Tight perimeters, in their phrasing, are what buys you the option to relax supervision.

The post argues for leaning on mature infrastructure rather than custom code: "Battle-tested hypervisors, syscall filters, and container runtimes have survived more adversarial attention than anything you'll build," with the corollary that "the weakest layer is the one you built yourself."

For credential handling specifically, Anthropic describes a scoped-token pattern: "Credentials stay in the host keychain, the VM gets a per-session scoped-down token, and that token can be revoked independently of the user's."

Context

Why it matters