The shortcomings of conventional testing
The slow response of AI companies to the escalating inadequacy of benchmarks can be partly attributed to the long-standing effectiveness of the test-scoring methodology.
A significant early achievement in contemporary AI was the ImageNet challenge, which served as a precursor to modern benchmarks. Launched in 2010 as an open invitation to researchers, this database contained over 3 million images for AI systems to classify into 1,000 distinct categories.
Importantly, the test was entirely indifferent to the methods employed, allowing any successful algorithm to gain credibility irrespective of its underlying mechanisms. When AlexNet emerged in 2012, utilizing a then-unorthodox form of GPU training, it established a critical benchmark in modern AI. Few anticipated that AlexNet’s convolutional neural networks would be the key to advancing image recognition—but after its stellar performance, no one contested it. (Ilya Sutskever, one of AlexNet’s creators, would later cofound OpenAI.)
A significant factor in the challenge’s effectiveness was the minimal practical distinction between ImageNet’s object classification task and the actual process of computer image recognition. Even amidst methodological disagreements, no one questioned that the top-performing model would have an edge in a real-world image recognition scenario.
However, in the 12 years since, AI researchers have extended that method-agnostic approach to increasingly generalized tasks. SWE-Bench is frequently utilized as a proxy for broader coding skills, while various exam-style benchmarks often represent reasoning capabilities. This broad approach complicates the rigor needed to define what a particular benchmark assesses, which in turn hinders the responsible application of the findings.
Where things falter
Anka Reuel, a PhD student concentrating on the benchmark dilemma in her research at Stanford, believes that the evaluation issue stems from this push for generality. “We’ve progressed from task-specific models to general-purpose ones,” Reuel states. “It’s no longer just about a single task, but a multitude of tasks, making evaluation increasingly difficult.”
Like Jacobs from the University of Michigan, Reuel asserts that “the chief concern with benchmarks is validity, even more than practical execution,” adding, “That’s where many issues arise.” For a complex task like coding, it is nearly impossible to cover every possible scenario within a problem set. Consequently, it becomes challenging to determine if a model is scoring higher due to superior coding skills or simply because it has effectively manipulated the problem set. With immense pressure on developers to achieve record scores, shortcuts can be tempting.
For developers, the belief is that success across numerous specific benchmarks will culminate in a generally capable model. Yet, the dynamics of agentic AI mean that a single AI system can incorporate a complex range of different models, complicating the evaluation of whether improvements in specific tasks will translate to broader generalization. “There are simply many more variables to adjust,” asserts Sayash Kapoor, a computer scientist at Princeton and a noted critic of careless practices in the AI industry. “When it comes to agents, they seem to have abandoned best practices for evaluation.”