Vibes, benchmarks and the “evaluation crisis” of intelligence
GPT-4.5 reminded us of a simple truth: evaluating intelligence is hard.
When GPT-4.5 launched last week, immediate reactions were mixed. Some hailed it as a groundbreaking advancement, while others dismissed it as a minor improvement over GPT-4o arguing it didn’t justify the wait and the cost (GPT4.5 is OpenAI’s largest model to date), even prematurely suggesting the death of pre-training scaling. Benchmarks failed to settle the debate; early benchmarks such as MMLU showed modest improvements, but their limited scope left crucial aspects of GPT-4.5’s capabilities unmeasured.
Enter Chatbot Arena's results, placing GPT-4.5 at the top spot across all categories (including hard prompt, coding, and creative writing).
It made me realize something: objective metrics (benchmarks) still fail to meaningfully measure progress in intelligence, while our subjective, instinctual, preferences (“vibes”, aggregated in Chatbot Arena) does capture something. It puzzled me at first, and I sense it says something profound about the role subjectivity has to play in the evaluation of intelligence, at least in the absence of more comprehensive and robust benchmarks.
And it ultimately speaks to the “evaluation crisis” coined by Andrej Karpathy: current evaluation methods remain inadequate precisely because intelligence resists simple quantification and requires the integration of subjective insights.
What is it we’re building and measuring exactly ?
The challenge of evaluating intelligence obviously begins by defining it - and crafting the right measurement tools. This is how the - now popular- ARC-AGI came to be, with François Chollet discussing the definition of intelligence, arguing that intelligence fundamentally involves adaptability, and subsequently building a benchmark that would capture this.
These days, many discussions revolve around the definition of the hottest concept in town: AGI. Roughly speaking, AGI is just human-level intelligence (as opposed to ASI that is beyond human-level). Everybody in the AI industry is now trying to build it - at least all the frontier labs - with exponential progress recently and prominent industry leaders like Dario Amodei saying that “AI systems better than humans at almost everything” will be around the corner by 2027, 2030 maximum.
But it remains hard to measure where we are on our AGI journey, and how we should measure progress towards that goal. And one big obstacle to what is the absence of a clear definition of AGI. Indeed we all intrinsically sense and share a vague but common understanding of what we’re talking about. The problem is that this vagueness is difficult for benchmarks to measure - but less so for our subjectivity. And the reason behind this, I would argue, is that AGI is essentially the projection of our intelligence reflected back at us.
Thus, defining and measuring artificial intelligence boils down to many of the same challenges we face in measuring and defining our own human intelligence.
The problem with benchmarking and the limitations of objectivity
Benchmarks can measure many dimensions of intelligence: formal reasoning, adaptability, utility, accuracy, and even emotional intelligence. Yet they presently cannot measure intelligence holistically.
Put simply, current benchmarks are not enough to measure the multi-faceted thing we call intelligence.
They are crucial but inherently limited as they only capture specific domains or dimensions. Benchmarks like MMLU and GPQA, although sophisticated, cannot fully encapsulate the nuanced, multifaceted nature of intelligence. They focus predominantly on specific, quantifiable capabilities—often neglecting other dimensions like adaptability, creativity, or emotional intelligence. New generations of benchmarks have started to emerge, like the famous ARC-AGI, an attempt to better capture dimensions like adaptability. And while our models are moving towards generality (cf. how we’re now using “real-world” tasks like Pokemon and Mario as benchmarks), they're still relatively narrow.
If advanced benchmarks were sufficient, model evaluation would be simple and unanimous—but as recent events show, they are neither. And I would go so far as to say that benchmarks will never be sufficient to measure intelligence fully.
There is a straightforward analogy to this; recruitment and testing.
Consider how we assess human intelligence and suitability for professional roles. Hiring managers don't simply run IQ, EQ or other tests on candidates. Because the intelligence recruiters are trying to measure cannot be adequately captured by standardized assessments. It spans across situational judgment, adaptability, creativity, emotional intelligence, and domain-specific skills—qualities difficult to quantify through formal testing alone.
Instead, candidates go through several rounds of interviews and the final decision is often the result of the aggregated preference expressed by the interviewers. In other words, the subjective experience of others measures the candidate’s intelligence and suitability.
Vibe checks and the role of “collective subjectivity”
An alternative method to benchmarks is vibe-checks. These can rely on either humans (through platforms like the Chatbot Arena) or other LLMs (using LLM-as-a-judge), but I want to talk here about human-based vibe-check.
An interesting aspect is that while “vibe” is hard to measure at the individual level (not to mention that it’d be of limited value because of various biases), it becomes a powerful metric once aggregated with other’s sentiments. The Chatbot Arena exemplifies this phenomenon well. Its strength lies precisely in aggregating subjective human judgments into an intersubjective consensus, and it has become one of the most influential measurement tools in the AI community.
But this is not the panacea. While crowdsourced vibe-checks (such as Chatbot Arena) are interesting, they’re not immune to biases and flaws. How representatives exactly are the Chatbot Arena testers ? I haven’t found anything about the demographics and profile, but one can assume that they are mostly tech enthusiasts, or “high taste testers”. These may not provide accurate signals of the preferences and utility regular users of LLMs would experience, especially as the user population grows exponentially. Improving transparency about testers' demographics or incorporating broader, diverse user populations would significantly enhance the reliability and relevance of Chatbot Arena’s results.
That said, “vibe-checking” at scale with a sufficiently representative (and more comprehensive) Chatbot Arena might even become a new scaling law. And further to this, we might hypothesise that in the absence of (1) a universal definition of intelligence and (2) of AGI benchmarks, a sign of achieving AGI could be an intersubjective consensus that “this model is better/more capable/more intelligent than me”.
Beyond the "evaluation crisis", building better bench
Back to the current situation, I find the framing of an "evaluation crisis" to be problematic because it implies that evaluation previously worked well. Maybe benchmarks were giving good signals while they were not saturated and contaminated, but they have always been incomplete and flawed. What we face today is a paradigm shift or expansion rather than a crisis. Models have simply outgrown the narrow confines of traditional evaluation metrics (both benchmarks and Chatbot Arena), demanding new, more comprehensive frameworks. Besides, the perceived “crisis” arises from frustration over benchmarks becoming saturated or misaligned with real-world utility. But this isn't a crisis—it's the inevitable growing pain of a maturing field.
This is all the more important that evaluation tools are not merely ex-post measurement; they play an active role in the making of intelligence. Better benchmarks and evaluation help AI developers produce better models. Benchmarks and evaluation frameworks guide the development of models capabilities. In that regard, the science of evaluation suffers from underinvestment compared to its potential benefits.
Better evaluation frameworks are needed, but they’re tedious to make, requiring both rigour and creativity. New benchmarks grounded in real-world examples are emerging (e.g. SWE-lancer) and excellent resources are flourishing though such as Clementine Fourrier’s practical Evaluation Guidebook or paper discussing benchmark quality such as Better bench.
The opportunity to advance collective intelligence ?
Evaluating intelligence—whether human or artificial—remains a profound challenge precisely because intelligence transcends simple metrics and simplistic definitions. It demands nuance, judgment, and creativity. It demands human vibes.
What I find the most compelling is that our pursuit of better evaluations could turn out as an opportunity to redefine and refine our collective intelligence —one vibe-check at a time.