As artificial intelligence (AI) technology evolves at breakneck speed, tech companies are facing a growing challenge: existing benchmarks for evaluating AI models are becoming outdated. Companies like OpenAI, Microsoft, Meta, and Anthropic are responding by designing new methods to assess their models' capabilities. These models, which include AI agents capable of executing tasks for humans autonomously, require more advanced testing due to the increasing complexity of the tasks they are expected to perform.
Historically, AI models were tested using standardized benchmarks such as Hellaswag and MMLU, which assess common sense and knowledge through multiple-choice questions. However, with newer models achieving near-perfect scores on these tests, researchers argue that these methods no longer provide a sufficient measure of a model's capabilities. Mark Chen, Senior VP of Research at OpenAI, states that the simplicity of human-written tests is no longer a reliable barometer for AI's performance.
In response, new testing methods are being developed to assess the models' ability to reason and plan over multiple tasks. One example is SWE-bench Verified, which was updated in August to better evaluate autonomous systems. Unlike previous tests, this benchmark uses real-world software problems sourced from GitHub, asking AI agents to solve engineering issues with code. In these tests, OpenAI’s GPT-4 solved 41.4% of the problems, while Anthropic’s Claude 3.5 Sonnet achieved 49%.
As Jared Kaplan, Chief Science Officer at Anthropic, notes, the complexity of these new benchmarks requires creating comprehensive "sandbox environments" for the AI to interact with, rather than relying on simple prompts.
The debate over whether AI models are truly capable of reasoning as humans do remains unresolved. Some researchers, including those at Apple, question whether current AI models are truly reasoning or merely pattern matching data they’ve already seen in training. However, companies like Microsoft are pushing the envelope by creating internal benchmarks that focus on reasoning, particularly in STEM subjects and coding tasks.
Ece Kamar, VP at Microsoft Research, emphasizes that reasoning is crucial for advancing AI capabilities, especially for autonomous systems that need to perform complex, multi-step tasks. This shift towards reasoning-based benchmarks is part of an effort to move AI testing beyond simple accuracy metrics and focus on how well AI models can understand and solve real-world problems.
While companies like Meta, OpenAI, and Microsoft have created their own benchmarks, this has raised concerns about transparency and the ability to compare AI systems objectively. Dan Hendrycks, Executive Director of the Center for AI Safety, warns that without public benchmarks, it is difficult for businesses or the broader society to gauge the true capabilities of different models.
One new initiative, “Humanity’s Last Exam,” crowdsourced complex, abstract reasoning questions from experts across various disciplines. Another new benchmark, FrontierMath, was designed to test advanced models with mathematically rigorous problems. Based on this benchmark, even the most advanced models struggled, completing less than 2% of the questions correctly.
The AI industry is grappling with the challenge of developing benchmarks that are both complex and reflective of real-world applications. While new testing methods are crucial to keep up with the technology's rapid evolution, they also highlight the ongoing debate about how to define and measure AI’s capabilities. As Ahmad Al-Dahle, Generative AI Lead at Meta, points out, benchmarks can become outdated as soon as they are defined because AI models are trained to meet them, not necessarily to provide a true measure of intelligence.
The future of AI testing seems to be moving towards more intricate and dynamic benchmarks that assess the reasoning, adaptability, and autonomous problem-solving abilities of AI systems—shifting the focus from mere accuracy to real-world utility.
FT.COM
Read More