Since the launch of ChatGPT by OpenAI in 2022, AI companies have been engaged in a competitive race to develop larger models, leading to significant investments in building data centers. However, by late last year,
that the advantages of scaling models were diminishing. The
of OpenAI’s largest model, GPT-4.5, reinforced this notion.
This shift in focus is prompting researchers to develop AI that “thinks” more like humans. Instead of merely increasing model size, they are allowing models more time to reason through problems. In 2023, a Google team introduced the
chain of thought (CoT) technique, enabling large language models (LLMs) to solve problems step-by-step.This methodology underlies the remarkable performance of new reasoning models such as OpenAI’s o3, Google’s Gemini 2.5, Anthropic’s Claude 3.7, and DeepSeek’s R1. AI research papers now frequently reference terms like “thought,” “thinking,” and “reasoning,” as cognitively inspired techniques continue to grow.“Since around spring of last year, it has been evident to serious AI researchers that the upcoming revolution will focus not on scaling but on enhancing cognition,” states
Igor Grossmann, a psychology professor at the University of Waterloo in Canada. “The next leap will center around improved cognition.”
Understanding AI Reasoning
LLMs fundamentally rely on statistical probabilities to predict the next token—the building blocks of text they process. However, the CoT method demonstrated that prompting models to lay out a series of “reasoning” steps before arriving at conclusions greatly improved their performance in math and logic.“It was unexpected that this approach would yield such impressive results,” says
Kanishk Gandhi, a computer science graduate student at Stanford University. Since that discovery, researchers have expanded on this technique, introducing concepts such as “
tree of thought,” “
diagram of thought,” “
logic of thought,” and “
iteration of thought,” among others.Major model developers are also incorporating
reinforcement learning into their frameworks, enabling a base model to generate CoT responses, rewarding those that yield the best answers. Consequently, models have adopted cognitive strategies akin to human problem-solving methods, such as deconstructing issues into smaller tasks and backtracking to rectify earlier missteps, as noted by Gandhi.However, the training methods for these models can introduce challenges, warns
Michael Saxon, a graduate student at the University of California, Santa Barbara. Reinforcement learning necessitates a means of verifying the correctness of a response to determine reward allocation. Consequently, reasoning models are primarily trained on easily verifiable tasks, like math or logical puzzles, causing them to approach all queries as if they were intricate reasoning challenges, which may lead to overthinking, according to Saxon.In a recent experiment documented in a
pre-print paper, he and his team assigned a range of deliberately simplistic tasks to various AI models, revealing that reasoning models utilize significantly more tokens to reach correct answers compared to traditional LLMs. In certain instances, this overanalysis even resulted in poorer outcomes. Interestingly, Saxon found that addressing the models as one would an overthinking human proved beneficial; researchers instructed the model to estimate the number of tokens it would require to solve the task and provided regular updates on the remaining tokens before it had to present an answer.“This has been a consistent insight,” Saxon remarks. “Even though these models do not truly function like humans, techniques inspired by our cognition can yield unexpectedly effective results.”
Challenges in AI Reasoning
There remain significant gaps in the reasoning capabilities of these models.
Martha Lewis, an assistant professor of neurosymbolic AI at the University of Amsterdam,
recently conducted a comparison of LLMs and human reasoning through analogies, believed to underpin much creative thinking.In standard analogy reasoning tests, both AI models and humans performed commendably. However, when presented with novel versions of these tests, AI performance dropped sharply compared to human responses. Lewis suggests that the models had been trained predominantly on data reflecting the standard test versions, relying on surface-level pattern recognition instead of genuine reasoning. Testing was conducted on OpenAI’s earlier models, GPT-3, GPT-3.5, and GPT-4, and Lewis posits that recently developed reasoning models might exhibit better results. Nonetheless, these findings highlight the necessity for prudent assessments of AI’s cognitive abilities.“The fluency of output produced by these models can easily mislead observers into believing they possess greater reasoning capabilities than they actually do,” Lewis cautions. “It is crucial not to label these models as reasoning agents without rigorously defining and testing what reasoning entails in specific contexts.”Another critical area where AI’s reasoning falls short is in understanding others’ mental states, known as theory of mind. Several studies have shown that LLMs can succeed in classical psychological assessments of this ability, but researchers at the Allen Institute for AI (AI2) suspected that this success might stem from the tests being present in the training datasets.Consequently, the researchers created
a new set of theory-of-mind evaluations based on real-life scenarios, aimed at measuring a model’s capacity to deduce mental states, predict their influence on behavior, and assess the reasonableness of actions. For instance, a model might learn that an individual picks up a closed packet of chips in a store, unaware that the contents are moldy. The model would then be asked whether the person understands the chips are moldy, if they would purchase the chips, and whether that action would be reasonable.The AI2 team discovered that while models excelled at deducing mental states, they struggled with predicting behavior and evaluating reasonableness. Research scientist
Ronan Le Bras believes this is due to the models calculating the likelihood of actions based on comprehensive data, recognizing that it’s improbable for someone to buy moldy chips. While they can deduce mental states, they don’t seem to factor in these states when predicting behavior.However, the researchers noted that prompting the models to recall their mental state predictions or applying specific CoT prompts to encourage consideration of the character’s awareness markedly improved results.
Yuling Gu, a predoctoral young investigator at AI2, states that it’s essential for models to apply appropriate reasoning patterns to specific challenges. “Our goal is that such reasoning will become more deeply integrated into these models in the future,” she concludes.
Enhancing AI Performance Through Metacognition
To enable models to reason flexibly across diverse tasks, a foundational shift may be necessary, according to Grossmann from the University of Waterloo. Last November, he co-authored
a paper highlighting the importance of instilling metacognition in models, defined as “the ability to reflect upon and regulate one’s thought processes.”Current models, according to Grossmann, are akin to “professional bullshit generators,” offering best guesses for questions without acknowledging or expressing uncertainty. They struggle to tailor responses to specific contexts or consider multiple perspectives—attributes that humans manage instinctively. Endowing models with metacognitive skills will enhance their performance and facilitate clearer reasoning processes, Grossmann affirms.Achieving this, however, poses challenges, as it either necessitates a substantial effort in labeling training data for aspects like certainty or relevance or requires the addition of new modules designed to evaluate the confidence of reasoning steps. Reasoning models already demand considerably more computational resources and energy than standard LLMs, making the introduction of these additional processes likely to exacerbate existing issues. “Such measures could jeopardize many small companies,” Grossmann cautions. “The environmental implications should not be overlooked.”Nevertheless, he remains optimistic that mimicking the cognitive processes inherent in human intelligence represents the most viable path forward, even if current endeavors remain simplistic. “We lack an alternative way of thinking,” he emphasizes. “We can only create models grounded in what we understand conceptually.”