Close Menu
IOupdate | IT News and SelfhostingIOupdate | IT News and Selfhosting
  • Home
  • News
  • Blog
  • Selfhosting
  • AI
  • Linux
  • Cyber Security
  • Gadgets
  • Gaming

Subscribe to Updates

Get the latest creative news from ioupdate about Tech trends, Gaming and Gadgets.

    What's Hot

    Automating Business Reports with Generative AI

    May 19, 2025

    Webinar: Harden Your Security Mindset: Break Down the Critical Security Risks for Web Apps

    May 19, 2025

    Why CTEM is the Winning Bet for CISOs in 2025

    May 19, 2025
    Facebook X (Twitter) Instagram
    Facebook Mastodon Bluesky Reddit
    IOupdate | IT News and SelfhostingIOupdate | IT News and Selfhosting
    • Home
    • News
    • Blog
    • Selfhosting
    • AI
    • Linux
    • Cyber Security
    • Gadgets
    • Gaming
    IOupdate | IT News and SelfhostingIOupdate | IT News and Selfhosting
    Home»Artificial Intelligence»How to build a better AI benchmark
    Artificial Intelligence

    How to build a better AI benchmark

    AndyBy AndyMay 8, 2025Updated:May 8, 2025No Comments3 Mins Read
    How to build a better AI benchmark

    The shortcomings of conventional testing

    The slow response of AI companies to the escalating inadequacy of benchmarks can be partly attributed to the long-standing effectiveness of the test-scoring methodology.

    A significant early achievement in contemporary AI was the ImageNet challenge, which served as a precursor to modern benchmarks. Launched in 2010 as an open invitation to researchers, this database contained over 3 million images for AI systems to classify into 1,000 distinct categories.

    Importantly, the test was entirely indifferent to the methods employed, allowing any successful algorithm to gain credibility irrespective of its underlying mechanisms. When AlexNet emerged in 2012, utilizing a then-unorthodox form of GPU training, it established a critical benchmark in modern AI. Few anticipated that AlexNet’s convolutional neural networks would be the key to advancing image recognition—but after its stellar performance, no one contested it. (Ilya Sutskever, one of AlexNet’s creators, would later cofound OpenAI.)

    A significant factor in the challenge’s effectiveness was the minimal practical distinction between ImageNet’s object classification task and the actual process of computer image recognition. Even amidst methodological disagreements, no one questioned that the top-performing model would have an edge in a real-world image recognition scenario.

    However, in the 12 years since, AI researchers have extended that method-agnostic approach to increasingly generalized tasks. SWE-Bench is frequently utilized as a proxy for broader coding skills, while various exam-style benchmarks often represent reasoning capabilities. This broad approach complicates the rigor needed to define what a particular benchmark assesses, which in turn hinders the responsible application of the findings.

    Where things falter

    Anka Reuel, a PhD student concentrating on the benchmark dilemma in her research at Stanford, believes that the evaluation issue stems from this push for generality. “We’ve progressed from task-specific models to general-purpose ones,” Reuel states. “It’s no longer just about a single task, but a multitude of tasks, making evaluation increasingly difficult.”

    Like Jacobs from the University of Michigan, Reuel asserts that “the chief concern with benchmarks is validity, even more than practical execution,” adding, “That’s where many issues arise.” For a complex task like coding, it is nearly impossible to cover every possible scenario within a problem set. Consequently, it becomes challenging to determine if a model is scoring higher due to superior coding skills or simply because it has effectively manipulated the problem set. With immense pressure on developers to achieve record scores, shortcuts can be tempting.

    For developers, the belief is that success across numerous specific benchmarks will culminate in a generally capable model. Yet, the dynamics of agentic AI mean that a single AI system can incorporate a complex range of different models, complicating the evaluation of whether improvements in specific tasks will translate to broader generalization. “There are simply many more variables to adjust,” asserts Sayash Kapoor, a computer scientist at Princeton and a noted critic of careless practices in the AI industry. “When it comes to agents, they seem to have abandoned best practices for evaluation.”

    Source link

    0 Like this
    benchmark build
    Share. Facebook LinkedIn Email Bluesky Reddit WhatsApp Threads Copy Link Twitter
    Previous ArticleSingapore’s Vision for AI Safety Bridges the US-China Divide
    Next Article What is Transfer Learning and How Does it Work?

    Related Posts

    Artificial Intelligence

    Automating Business Reports with Generative AI

    May 19, 2025
    Artificial Intelligence

    OpenAI Launches an Agentic, Web-Based Coding Tool

    May 19, 2025
    Artificial Intelligence

    Inside the story that enraged OpenAI

    May 19, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    AI Developers Look Beyond Chain-of-Thought Prompting

    May 9, 202515 Views

    6 Reasons Not to Use US Internet Services Under Trump Anymore – An EU Perspective

    April 21, 202512 Views

    Andy’s Tech

    April 19, 20259 Views
    Stay In Touch
    • Facebook
    • Mastodon
    • Bluesky
    • Reddit

    Subscribe to Updates

    Get the latest creative news from ioupdate about Tech trends, Gaming and Gadgets.

      About Us

      Welcome to IOupdate — your trusted source for the latest in IT news and self-hosting insights. At IOupdate, we are a dedicated team of technology enthusiasts committed to delivering timely and relevant information in the ever-evolving world of information technology. Our passion lies in exploring the realms of self-hosting, open-source solutions, and the broader IT landscape.

      Most Popular

      AI Developers Look Beyond Chain-of-Thought Prompting

      May 9, 202515 Views

      6 Reasons Not to Use US Internet Services Under Trump Anymore – An EU Perspective

      April 21, 202512 Views

      Subscribe to Updates

        Facebook Mastodon Bluesky Reddit
        • About Us
        • Contact Us
        • Disclaimer
        • Privacy Policy
        • Terms and Conditions
        © 2025 ioupdate. All Right Reserved.

        Type above and press Enter to search. Press Esc to cancel.