AI benchmarks are broken. Here’s what we need instead.
AI Benchmarks Are Broken. Here's What We Need Instead.
For decades, artificial intelligence has been evaluated through the question of whether machines outperform humans. From chess to advanced math, from coding to essay writing, the performance of AI models and applications is tested against that of individual humans completing tasks. This framing is seductive: An AI vs. human comparison on isolated problems with clear right or wrong answers is easy to standardize, compare, and optimize. It generates rankings and headlines.
But there's a problem: AI is almost never used in the way it is benchmarked. Although researchers and industry have started to improve benchmarking by moving beyond static tests to more dynamic evaluation methods, these innovations resolve only part of the issue. That's because they still evaluate AI's performance outside the human teams and organizational workflows where its real-world performance ultimately unfolds.
While AI is evaluated at the task level in a vacuum, it is used in messy, complex environments where it usually interacts with more than one person. Its performance (or lack thereof) emerges only over extended periods of use. This misalignment leaves us misunderstanding AI's capabilities, overlooking systemic risks, and misjudging its economic and social consequences.
The Misalignment Between Benchmark and Real-World Performance
To illustrate this point, let's consider the example of medical AI models. In the radiology units of hospitals from the heart of California to the outskirts of London, I witnessed staff using highly ranked radiology AI applications. Repeatedly, it took them extra time to interpret AI's outputs alongside hospital-specific reporting standards and nation-specific regulatory requirements. What appeared as a productivity-enhancing AI tool when tested in a vacuum introduced delays in practice.
It soon became clear that the benchmark tests on which medical AI models are assessed do not capture how medical decisions are actually made. Hospitals rely on multidisciplinary teams—radiologists, oncologists, physicists, nurses—who jointly review patients. Treatment planning rarely hinges on a static decision; it evolves as new information emerges over days or weeks. Decisions often arise through constructive debate and trade-offs between professional standards, patient preferences, and the shared goal of long-term patient well-being. No wonder even highly scored AI models struggle to deliver the promised performance once they encounter the complex, collaborative processes of real clinical care.
The Consequences of Misaligned Benchmarks
When high benchmark scores fail to translate into real-world performance, even the most highly scored AI is soon abandoned to what I call the "AI graveyard." The costs are significant: Time, effort, and money end up being wasted. And over time, repeated experiences like this erode organizational confidence in AI and—in critical settings such as health—may erode broader public trust in the technology as well.
Building Better Tests: The HAIC Approach
To close the gap between benchmark and real-world performance, we must pay attention to the actual conditions in which AI models will be used. The critical questions: Can AI function as a productive participant within human teams? And can it generate sustained, collective value?
Through my research on AI deployment across multiple sectors, I have seen a number of organizations already moving—deliberately and experimentally—toward the HAIC benchmarks I favor. HAIC benchmarks reframe current benchmarking in four ways:
- From individual and single-task performance to team and workflow performance: Shifting the unit of analysis.
- From one-off testing with right/wrong answers to long-term impacts: Expanding the time horizon.
- From correctness and speed to organizational outcomes, coordination quality, and error detectability: Expanding outcome measures.
- From isolated outputs to upstream and downstream consequences: System effects.
Shifting the Unit of Analysis
For example, in one UK hospital system in the period 2021–2024, the question expanded from whether a medical AI application improves diagnostic accuracy to how the presence of AI within the hospital's multidisciplinary teams affects not only accuracy but also coordination and deliberation. The hospital specifically assessed coordination and deliberation in human teams using and not using AI. Multiple stakeholders (within and outside the hospital) decided on metrics like how AI influences collective reasoning, whether it surfaces overlooked considerations, whether it strengthens or weakens coordination, and whether it changes established risk and compliance practices.
Evaluating Performance Over Time
Today's benchmarks resemble school exams—one-off, standardized tests of accuracy. But real professional competence is assessed differently. Junior doctors and lawyers are evaluated continuously inside real workflows, under supervision, with feedback loops and accountability structures. Performance is judged over time and in a specific context, because competence is relational. If AI systems are meant to operate alongside professionals, their impact should be judged longitudinally, reflecting how performance unfolds over repeated interactions.
Conclusion
The HAIC approach promises to make benchmarking more complex, resource-intensive, and harder to standardize. But continuing to evaluate AI in sanitized conditions detached from the world of work will leave us misunderstanding what it truly can and cannot do for us. To deploy AI responsibly in real-world settings, we must measure what actually matters: not just what a model can do alone, but what it enables—or undermines—when humans and teams in the real world work with it.
In the end, the future of AI depends on our ability to ask the right questions and measure the right outcomes. By shifting our focus from individual task performance to team and workflow performance, from one-off testing to long-term impacts, and from correctness and speed to organizational outcomes, coordination quality, and error detectability, we can create a more accurate and responsible picture of AI's capabilities and limitations.




