ARC-AGI-2: Leading AI models fail new test of artificial general intelligence
The ARC-AGI-2 benchmark is designed to be a difficult test for AI models
Just_Super/Getty Images
The most sophisticated AI models in existence today have scored poorly on a new benchmark designed to measure their progress towards artificial general intelligence (AGI) – and brute-force computing power won’t be enough to improve, as evaluators are now taking into account the cost of running the model.
There are many competing definitions of AGI, but it is generally taken to refer to an AI that can perform any cognitive task that humans can do. To measure this, the ARC Prize Foundation previously launched a test of reasoning abilities called ARC-AGI-1. Last December, OpenAI announced that its o3 model had scored highly on the test, leading some to ask if the company was close to achieving AGI.
But now a new test, ARC-AGI-2, has raised the bar. It is difficult enough that no current AI system on the market can achieve more than a single-digit score out of 100 on the test, while every question has been solved by at least two humans in fewer than two attempts.
In a blog post announcing ARC-AGI-2, ARC president Greg Kamradt said the new benchmark was required to test different skills from the previous iteration. “To beat it, you must demonstrate both a high level of adaptability and high efficiency,” he wrote.
The ARC-AGI-2 benchmark differs from other AI benchmark tests in that it focuses on AI models’ abilities to complete simplistic tasks – such as replicating changes in a new image based on past examples of symbolic interpretation – rather than their ability to match world-leading PhD performances. Current models are good at “deep learning”, which ARC-AGI-1 measured, but are not as good at the seemingly simpler tasks, which require more challenging thinking and interaction, in ARC-AGI-2. OpenAI’s o3-low model, for instance, scores 75.7 per cent on ARC-AGI-1, but just 4 per cent on ARC-AGI-2.
The benchmark also adds a new dimension to measuring an AI’s capabilities, by looking at its efficiency in problem-solving, as measured by the cost required to complete a task. For example, while ARC paid its human testers $17 per task, it estimates that o3-low costs OpenAI $200 in fees for the same work.
“I think the new iteration of ARC-AGI now focusing on balancing performance with efficiency is a big step towards a more realistic evaluation of AI models,” says Joseph Imperial at the University of Bath, UK. “This is a sign that we’re moving from one-dimensional evaluation tests solely focusing on performance but also considering less compute power.”
Any model that is able to pass ARC-AGI-2 would need to not just be highly competent, but also smaller and lightweight, says Imperial – with the efficiency of the model being a key component of the new benchmark. This could help address concerns that AI models are becoming more energy-intensive – sometimes to the point of wastefulness – to achieve ever-greater results.
However, not everyone is convinced that the new measure is beneficial. “The whole framing of this as it testing intelligence is not the right framing,” says Catherine Flick at the University of Staffordshire, UK. Instead, she says these benchmarks merely assess an AI’s ability to complete a single task or set of tasks well, which is then extrapolated to mean general capabilities across a series of tasks.
Performing well on these benchmarks should not be seen as a major moment towards AGI, says Flick: “You see the media pick up that these models are passing these human-level intelligence tests, where actually they’re not; what they are doing is really just responding to a particular prompt accurately.”
And exactly what happens if or when ARC-AGI-2 is passed is another question – will we need yet another benchmark? “If they were to develop ARC-AGI-3, I’m guessing they would add another axis in the graph denoting [the] minimum number of humans – whether expert or not – it would take to solve the tasks, in addition to performance and efficiency,” says Imperial. In other words, the debate over AGI is unlikely to be settled soon.
Topics:
Source link