Research / T-2026-6217

Anthropic's internal data shows AI is closing the loop on its own development

Q: Anthropic's internal data shows AI is closing the loop on its own development — key point 1

Anthropic engineers merge 8x more code per quarter than in 2024, with over 80% of code authored by Claude as of May 2026.

Q: Anthropic's internal data shows AI is closing the loop on its own development — key point 2

Claude's autonomous task length doubled from 4 minutes in 2024 to 12 hours in 2026, with doubling time accelerating from 7 to 4 months.

Q: Anthropic's internal data shows AI is closing the loop on its own development — key point 3

Anthropic warns recursive self-improvement may increase risks of losing control, as the bottleneck shifts from execution to goal selection.

Anthropic releases internal data on recursive self-improvement. Engineers ship 8x more code. Claude writes 80% of merged code. The path to autonomous AI development is visible.

Tessera Newsroom · 5 min read · June 5, 2026

Source When AI Builds Itself: Our progress toward recursive self-improvement (anthropic.com)

FIGURE T-2026-6217

8x RESEARCH

Anthropic published internal data this week showing that AI systems are already accelerating their own development at a pace most institutions are not prepared for. The company’s Anthropic Institute released previously unreported metrics on how much code Claude writes, how fast engineers ship, and how close the industry is to a system that could design its own successor.

The headline number: Anthropic engineers today merge 8x as much code per quarter as they did in 2024. More than 80% of the code merged into Anthropic’s codebase as of May 2026 was authored by Claude. Before Claude Code launched in February 2025, that figure was in the low single digits.

These numbers come from Anthropic itself, and the company is transparent about the caveats. Lines of code is an imperfect productivity measure. The 8x figure “almost certainly” overstates true productivity gains, Anthropic says. But the direction is unambiguous: AI is writing the code that builds the next AI, and the humans are shifting from typing to directing and reviewing.

The trend is visible on public benchmarks too. SWE-bench, which tests real-world software engineering by asking models to fix bugs in open-source codebases, went from low single-digit scores to saturation in two years. CORE-Bench, which tests whether a model can reproduce published research, went from roughly 20% success in 2024 to saturation fifteen months later. The length of tasks models can complete autonomously has been doubling roughly every four months, up from an earlier trend of seven months.

Claude Opus 3 in March 2024 could complete tasks that take humans about four minutes. Claude Sonnet 3.7 a year later managed tasks taking about 90 minutes. Claude Opus 4.6 a year after that handled 12-hour tasks. If the trend holds, tasks taking days could come into range this year. In 2027, systems could handle tasks that take a person weeks.

The more interesting data comes from inside Anthropic’s engineering and research teams. A March 2026 poll of 130 employees across research teams found the median respondent estimated producing around 4x as much output with Mythos Preview as they would have without any AI models. Anthropic says the true uplift was likely somewhat lower, but calls the overall claim “plausible” and in line with other observations.

Claude’s success rate on the most open-ended tasks reached 76% in May 2026, up 50 percentage points in six months. These are tasks with no clear specification, where the engineer does not know what the answer looks like. Anthropic gives an example: a routine upgrade began crashing tens of thousands of training jobs. An engineer pointed Claude at the live incident with little more than text content and cluster access. Claude isolated the obscure debugging flag triggering the crash, reproduced it, and confirmed a fix in about two hours. That would normally be two to three days of work.

The quality of Claude-written code is improving too. Anthropic says there is not full consensus among staff, but many believe Claude-written code was still worse than human-written code in late 2025 and is “roughly at parity today.” The company expects it to be “strictly better within the year.”

Anthropic now uses an automated Claude reviewer that reads every proposed change to the codebase before merge, looking for bugs and security flaws. A retrospective analysis found that this automated review would have caught roughly a third of the bugs behind past incidents on claude.ai before they ever reached production. The engineers who wrote that code are among the best in the world at building these systems. Claude is catching their mistakes.

The gap that remains is judgment. Claude can match or outperform skilled humans at executing a well-specified experiment. It can be handed an underspecified problem and figure out how to solve it. But large performance gaps persist when it comes to Claude exercising judgment in choosing goals. That is the gap between AI today and a system that could autonomously design its own successor.

Anthropic frames this as a progression. From 2021 to 2023, humans wrote all the code on laptops. From 2023 to 2025, chatbots helped with snippets. From 2025 to 2026, coding agents wrote entire files. Today, autonomous agents run code themselves and delegate work to other agents. The next step, labeled “20XX?” in Anthropic’s timeline, is agents capable enough to build and train models themselves.

The company is careful to say recursive self-improvement is not inevitable. But the data suggests it is approaching faster than most institutions expect. The doubling time for autonomous task length has already accelerated from seven months to four months. The benchmarks are saturating in months, not years. The code quality gap between human and AI is closing within a single year.

For AI builders, the implication is straightforward: the bottleneck is shifting from engineering execution to goal selection. If Claude can already write most of the code and catch bugs that human experts miss, the marginal value of another engineer typing code is dropping fast. The marginal value of an engineer who can specify the right problem, set the right goal, and evaluate the result is rising.

For AI safety, the implications are more uncomfortable. Anthropic explicitly notes that full recursive self-improvement “might increase the risks of humans losing control over AI systems.” If systems can build their own successors, the ways we secure them, monitor them, and shape their behavior all grow much more important. The company is publishing this data in part to push institutions to prepare.

The data suggests that preparation time is shorter than it looks. The trend lines are exponential, not linear. The gap between assisted and autonomous development is narrowing on every measurable dimension. The question is not whether a system can write code that builds the next model. The question is when the system no longer needs a human to tell it what to build.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / RESEARCH

AI advice made people less accurate but more confident, study finds

A new study finds that access to AI advice collapsed participants' willingness to say 'I don't know' from 44% to 3%, while accuracy dropped and confidence surged.

Tessera Newsroom · July 20, 2026

Research / T-2026-9458

GPT-5.6 Used a Prompt to Close a 30-Year Gap in Convex Optimization

A Reddit thread reports that GPT-5.6 closed a 30-year gap in convex optimization with a single prompt. The proof compiles in Lean 4 without unproven steps.

Tessera Newsroom · July 19, 2026

Research / T-2026-2742

The Arabic-language gap in off-the-shelf AI

Off-the-shelf AI still underperforms for Arabic-first products in dialect, RTL, and cultural grounding. We examine the gap and the regional integration work that closes it.

Tessera Newsroom · July 17, 2026