Meta has launched two new models in its Llama 4 family, named Maverick and Scout, aimed at advancing the company's AI offerings. These models have generated a significant amount of attention, particularly due to their performance in recent benchmarks. However, questions surrounding their transparency and testing methods have sparked confusion.
Meta’s Llama 4 Models: An Overview
Released on April 5, 2025, Meta's Llama 4 models, Maverick and Scout, promise to be significant players in the growing generative AI market. These models are designed to power Meta’s various services, including Instagram, WhatsApp, and Messenger. Meta is positioning its new AI models as a challenge to competitors like OpenAI's ChatGPT and Google’s Gemini series, aiming to offer high-performance models that don’t require immense computational power to operate.
Benchmarking Confusion: Maverick Model's Performance
Meta made headlines with the claim that Maverick outperformed OpenAI’s GPT-4o in LMArena, a crowdsource AI benchmarking platform. However, there was a critical detail Meta failed to disclose. The model submitted for testing was not the same version available to the public. It was an experimental version, labeled “llama-4-maverick-03-26-experimental,” optimized for conversationality. This clarification, buried in fine print on Meta’s website, raised questions about the fairness of the benchmark.
LMArena Responds to Meta’s Misinterpretation
LMArena, which tracks AI model performance based on user voting, voiced concerns about Meta’s lack of transparency. The platform stated that Meta’s interpretation of the rules did not align with their expectations. Meta’s choice to submit a custom-tuned model designed specifically to optimize human-like responses led to accusations that the company was manipulating the test results. Despite these claims, Meta’s Vice President of Generative AI, Ahmad Al Dahle, denied the allegations, asserting that the differences in results were due to ongoing adjustments to stabilize the models.
Key Differences Between Maverick and Scout Models
The two Llama 4 models cater to different needs. Scout, the smaller of the two, operates with 17 billion parameters and can run on a single Nvidia H100 GPU. It is designed for efficiency, utilizing 16 experts (sub-networks) to enhance its performance on a range of tasks. Maverick, in contrast, is a more powerful model with 17 billion parameters and 128 experts, allowing for a broader scope of capabilities. Both models are open-weight, meaning developers can examine their underlying structures and customize them for specific use cases.
Meta Faces Backlash Over Benchmarking Practices
Meta's claims about the Maverick model’s performance quickly faced backlash after the discrepancy between the submitted test model and the public version was revealed. Critics argued that the benchmarking results were skewed and not representative of the real-world performance users would experience. LMArena responded by updating its leaderboard policies to prevent similar issues in the future.
Despite the controversy, Meta remains committed to improving and refining its models. The company has since made the open-source version of Llama 4 available for developers to explore and adapt. Meta also teased the upcoming release of additional models in the Llama 4 family, including a base model called Behemoth and a reasoning-focused model, expected to be detailed at LlamaCon later this month.
PHOTO: REUTERS/DADO RUVIC
This article was created with AI assistance.
Read More