A team of researchers led by Suketu Patel put several leading large language models through a classic psychology experiment and found something that should unsettle anyone building on top of these systems. The models failed the Stroop task at scale. Their performance collapsed from near-perfect accuracy on short lists to near-random guessing on longer ones. The results, published in PNAS Nexus, point to a fundamental limitation in how transformer attention handles sustained cognitive control.
The Stroop task is simple. A color word like “red” appears in colored ink. Sometimes the word and ink match. Sometimes they conflict, like the word “red” printed in blue ink. The instruction is to name the ink color, not read the word. For humans, this creates a conflict because reading is automatic. The brain must suppress the habitual response and focus on the goal. Psychologists call this executive control.
Patel and his co-authors Hongbin Wang and Jin Fan wanted to see whether transformer-based models handle this conflict the same way humans do. They tested GPT-4o, GPT-5, Claude 3.5 Sonnet, Claude Opus 4.1, and Gemini 2.5. On short lists of five color words, the models performed well, even on mismatched items. GPT-4o hit 91% accuracy.
Then the researchers lengthened the lists. At ten words, GPT-4o fell to 57%. At forty words, accuracy dropped to 15%. Claude 3.5 Sonnet held stable through twenty words, then crashed to 24% at forty. The other models showed similar patterns. When the researchers mixed matching and mismatching words in the same list, accuracy for the mismatched items approached zero.
The models were not just getting confused. They were reverting to their default behavior: reading the word instead of naming the ink color. The instruction to suppress the automatic response degraded as the sequence grew. The systems lacked the executive control to maintain the goal across longer contexts.
This is not a minor benchmark failure. It is a structural weakness in how transformer attention distributes information. The self-attention mechanism in these models does not have a persistent goal state that can override learned associations across a long sequence. Each token competes for attention, and as the sequence grows, the original instruction gets diluted. The model defaults to the most statistically frequent behavior, which is reading the word.
Humans face the same conflict. Reading is automatic for literate adults. Yet most people can sustain high accuracy on the Stroop task for long lists. The brain has dedicated circuits for executive control, primarily in the prefrontal cortex, that maintain goal representations and inhibit competing responses. Transformers have no equivalent.
The study’s authors argue that this points to a fundamental limitation in current architectures. The paper, titled “Deficient executive control in transformer attention,” is careful not to overclaim. It does not say models cannot be improved. It says the current approach to attention, based on learned patterns rather than sustained goal maintenance, produces a specific failure mode that becomes visible under cognitive load.
This matters for anyone deploying these models in production. A chatbot that answers a single question well may fail on a multi-turn conversation where the original instruction must be maintained across many exchanges. An agent that follows a five-step plan may lose the thread at step twenty. A code assistant that generates correct individual functions may produce incoherent results across a long file.
The industry has been chasing context length as a metric. Models now claim 128k, 1M, even 10M token context windows. The implicit assumption is that longer context means better reasoning. The Stroop results suggest the opposite. Longer context may actually degrade performance because the model cannot maintain its goal across the full sequence. The attention mechanism spreads itself thin.
This is not a problem that more data or bigger models will solve on their own. The transformer architecture does not have a built-in mechanism for goal persistence. The instruction is just another token in the sequence. It competes with every other token for attention. As the sequence grows, the signal from the original instruction attenuates.
The research community has been exploring alternatives. State-space models like Mamba and recurrent architectures like RWKV use a different approach to memory that may preserve goal information across longer sequences. Some researchers are experimenting with explicit working memory modules that maintain a persistent goal representation. The Stroop results add urgency to these efforts.
For builders, the takeaway is practical. Do not assume that a model that performs well on short tasks will generalize to longer ones. Test for sustained attention, not just single-shot accuracy. The Stroop task is a cheap diagnostic. Run it on your model before deploying it in any application that requires maintaining a goal across many steps.
The paper closes with an observation that is worth sitting with. Human executive control is not a learned pattern. It is a separate cognitive system that evolved to regulate attention in the presence of competing demands. Transformers do not have this system. They have attention, but not control of attention. That is the gap this study measures.