Research / T-2026-3820

GPT-5.5 Codex has a reasoning-token clustering problem at 516, 1034, and 1552

An analysis of 390,000 Codex responses shows GPT-5.5 clusters at fixed reasoning-token counts, coinciding with a sharp drop in reasoning intensity and task quality.

Tessera Newsroom · 3 min read · July 5, 2026

Source GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance (github.com)

FIGURE T-2026-3820

5.5 RESEARCH

A GitHub issue filed against OpenAI’s Codex repository presents troubling evidence that GPT-5.5, the company’s latest coding model, has a systematic reasoning-token clustering problem. The issue, posted on June 27, analyzes 390,195 response-level token records across 865 sessions and finds that GPT-5.5 responses disproportionately land at exactly 516 reasoning output tokens, with additional fixed-boundary spikes at 1034 and 1552.

The pattern is stark. GPT-5.5 accounts for only 19.3 percent of all responses in the dataset, but it represents 82.0 percent of exact-516 events. Its ratio of exact-516 responses to responses with 516 or more reasoning tokens is 44.0 percent. For non-GPT-5.5 models, that ratio is 1.3 percent. That is a roughly 34x difference.

The anomaly is not static. It grew dramatically over the five-month observation window. In February 2026, exact-516 events made up 0.11 percent of responses with at least 516 reasoning tokens. By May, that figure had jumped to 53.30 percent. In June it dropped slightly to 35.84 percent, but remained orders of magnitude above the February baseline.

What the numbers show

The issue’s author, a user identified as vguptaa45, analyzed token_count metadata from Codex’s telemetry. They compared GPT-5.5 against GPT-5.4, GPT-5.2, GPT-5.3-codex, and GPT-5.3-codex-spark. The results are model-specific. GPT-5.4 shows a 19.8 percent exact-516 ratio. GPT-5.2 shows 0.34 percent. The two Codex-specific variants show 0.0 percent.

The clustering at 516, 1034, and 1552 looks like threshold boundaries. These are not natural stopping points in a distribution that varies with task complexity. They look like budget caps, scheduler decisions, or fallback behaviors baked into the model’s inference pipeline.

The timing is suspicious. As exact-516 clustering rose, overall reasoning-token intensity fell. Mean reasoning tokens per response dropped from 268.1 in February to 106.9 in May. P90 reasoning tokens fell from 772 to 344 over the same period. The model is not just clustering at a fixed value. It is reasoning less overall.

Performance degradation is the real story

The GitHub issue references a related report, #29353, which describes a task-level reproduction where GPT-5.5 runs ending at exactly 516 reasoning tokens returned the wrong answer. The aggregate data in the new issue adds weight to that claim. If a model consistently stops reasoning at an arbitrary token boundary, its performance on complex tasks will suffer.

The issue’s author is careful. They do not claim this proves hidden chain-of-thought truncation. The narrower claim is that Codex telemetry shows a GPT-5.5-specific fixed-token clustering anomaly that looks consistent with thresholded reasoning-budget behavior. But the implication for anyone using GPT-5.5 on high-stakes coding tasks is clear: the model may be silently cutting off its own reasoning process at a pre-set limit.

OpenAI has not responded to the issue publicly. The repository labels include “bug”, “model-behavior”, and “rate-limits”. That last label is notable. It suggests the clustering might be related to token budget enforcement, not a model architecture issue.

What this means for AI builders

If you are building on GPT-5.5 Codex, you need to know whether your responses are hitting the 516-token boundary. The issue provides a diagnostic: query token_count events with reasoning_output_tokens by model, compare exact-value counts for 0, 516, 1034, and 1552, and compute the ratio of exact-516 to responses with 516 or more tokens by model and day.

The data suggests the problem is not uniform. It worsened sharply from April to May, then partially recovered in June. That pattern could point to a configuration change, a routing update, or a degraded inference tier that was partially rolled back.

For the broader AI industry, this is a reminder that inference infrastructure choices directly shape model behavior. A reasoning-budget cap is not a model architecture decision. It is an operational decision. And it can silently degrade quality without any change to the model weights.

The open question

The issue asks the Codex team to investigate whether GPT-5.5 has a reasoning-budget, routing, truncation, fallback, or scheduler behavior that causes responses to terminate around these fixed token counts. It also asks whether exact 516 indicates a normal stopping point, a budget cap, a degraded tier, or another internal threshold.

Those are the right questions. Until OpenAI answers them, anyone relying on GPT-5.5 for complex reasoning tasks should treat a 516-token response as a potential failure signal, not a completed answer.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / RESEARCH

Wafer hits 2626 tok/s on GLM5.2 with AMD MI355X at over 2x lower cost than Blackwell

Wafer serves GLM5.2 on AMD MI355X at 2626 tok/s/node and 213 tok/s single stream, outperforming Blackwell on cost while requiring no custom kernels.

Tessera Newsroom · July 4, 2026

Research / T-2026-0687

Claude-real-video: the hack that lets any LLM actually watch a video

A new open-source tool, claude-real-video, extracts meaningful frames from any video and feeds them directly to an LLM — no cloud upload, no transcript-only blind spot.

Tessera Newsroom · July 3, 2026

Research / T-2026-7117

Parsewise (YC P25) bets that document AI's bottleneck is trust, not extraction

Parsewise, founded by ex-Palantir and Bain engineers, launches an API for cross-document data extraction with word-level citations, beating Gemini on the Databricks OfficeQA…

Tessera Newsroom · July 2, 2026