Anthropic published a detailed engineering post on Wednesday describing how it contains Claude across three product surfaces: claude.ai, Claude Code, and Claude Cowork. The post reveals a company that has fundamentally changed its risk posture over the past year — going from rejecting the idea of granting Claude access sufficient to take down an internal Anthropic service to making that level of access routine.
The engineering question, as Anthropic frames it, is how to cap the blast radius. The answer is not one mechanism but three overlapping layers of defense: the environment the agent runs in, the model the agent consults, and the external content the agent can reach. Each product gets a different combination.
Claude.ai runs code in gVisor containers on isolated infrastructure. No code touches the user’s machine. The filesystem is ephemeral per session. The blast radius is minimal, but so is the ceiling on what Claude can do. There is no persistent workspace, no access to the user’s filesystem. The threat model here is traditional: protect Anthropic’s infrastructure and keep tenants isolated from each other.
Claude Code runs on the user’s machine with access to filesystem, shell, and network. This is the product that forced Anthropic to confront the limits of human-in-the-loop supervision. The company launched Claude Code with the simplest possible defense: allow reads, require approval for write, bash, and network access. Approval fatigue showed up within weeks. Telemetry showed users approved roughly 93% of permission prompts. The more approvals a user sees, the less attention they pay to each.
Anthropic shipped an OS-level sandbox — Seatbelt on macOS, bubblewrap on Linux — that hardens the boundary. Reads are allowed, writes are allowed inside the workspace, but network is denied by default. Within the sandbox, the agent runs largely without interruption. The result was an 84% reduction in permission prompts. The runtime is open-source and auditable.
The company also built Claude Code auto mode, which automates safer approvals to reduce fatigue. Still, vulnerabilities remain. Any probabilistic defense has a non-zero miss rate.
The most surprising failures came from things Anthropic did not anticipate. Between mid-2025 and January 2026, the company received reports of vulnerabilities through its responsible disclosure program. Three exploited code that executes before the user has consented to anything. The most direct case: a developer clones a repository to review a pull request, and that repository contains a .claude/settings.json which defines a hook. Because Claude Code reads project settings during startup — before presenting the standard “Do you trust this folder?” prompt — the hook the attacker authored would execute automatically.
The fix: defer parsing and execution of project-local configuration until after the user accepts the trust prompt. Anthropic’s advice to builders is blunt: treat project-open, config-load, and localhost listeners the way you would treat any inbound request from the internet. They should not be implicitly trusted just because they feel local and arrive before the user has consented.
In February 2026, during a controlled internal red-team exercise, a researcher successfully phished an employee into launching Claude Code with a malicious prompt. The user became an injection vector. This is a class of risk that no amount of sandboxing solves if the user themselves can be turned against the agent.
Cowork, Anthropic’s most capable agent product, gets the tightest containment. The post does not detail the architecture but implies it is the most locked-down environment of the three.
The post also reveals that Claude Mythos Preview, a model deemed too risky to ship in April 2026, exists. Anthropic expects broader release of models with similar capability levels to become appropriate as defenders harden critical systems and safeguards mature. The company is explicit that model capability is an important factor in the total risk of an agent’s deployment.
On prompt injection defenses, Anthropic reports that Claude Opus 4.7 holds attack success to roughly 0.1% on single attempts on Gray Swan’s Agent Red Teaming benchmark, and around 5-6% after 100 adaptive attempts. Claude Code auto mode catches roughly 83% of overeager behaviors before they execute. Yet the company acknowledges that protection in the model layer will never be 100% effective, which is why it cannot stand alone.
The post is notable for what it does not say. There is no mention of any regulatory framework, no reference to the EU AI Act or any government safety institute. The containment strategy is entirely self-imposed engineering discipline, not compliance with external mandates.
What this means for AI builders is straightforward. The tradeoff between capability and safety is not a one-time decision at model release. It is a continuous negotiation across every product surface, every user interaction, every file read. The blast radius is not a property of the model alone. It is a property of the environment the model is placed in, the permissions it is granted, and the vigilance of the human who supervises it. Anthropic’s post is a catalog of things that broke, and the fixes are almost always structural — change the architecture, not the prompt.