Study finds AI models fail Stroop test as task length increases
An international team of researchers has tested leading large language models using the Stroop test, a classic psychological task designed to measure attention and cognitive control.
The study, published in PNAS Nexus, found that AI performance declines sharply as task length increases, in some cases approaching near-complete failure.
The Stroop test presents participants with colour words printed in incongruent ink colours and asks them to identify the ink colour while ignoring the word itself. For example, the word “red” printed in blue ink requires the response “blue.” Humans typically perform this task reliably, even with long sequences, as they can suppress the automatic tendency to read the word.
The research team, led by Suketu Patel, evaluated several leading models, including GPT-4o, Claude 3.5 Sonnet, GPT-5, Claude Opus 4.1, and Gemini 2.5.
With short sequences of five words, all models performed well. However, accuracy declined significantly as the task length increased. GPT-4o achieved 91% accuracy at five words, dropped to 57% at ten words, and fell to just 15% at forty words. Claude 3.5 Sonnet maintained stronger performance up to twenty words, after which its accuracy dropped to 24%.
According to the authors, the models tend to “forget” the instruction over longer sequences and revert to the behaviour most strongly represented in their training — reading the words rather than focusing on ink colour. The researchers argue that this behaviour distinguishes current AI systems from humans, who are capable of sustaining stable voluntary attention over an extended task.
By Tamilla Hasanova







