The debate over AI benchmarks and how companies report them has taken center stage, with OpenAI and Elon Musk’s AI company, xAI, engaging in a public dispute over recent test results of Grok 3’s performance. This controversy highlights the challenges in evaluating AI models and raises questions about transparency in performance reporting.

This week, an OpenAI employee accused xAI of publishing misleading benchmark results for its latest AI model, Grok 3. In response, Igor Babushkin, one of xAI’s co-founders, defended the company’s claims, insisting that the reported results were accurate. However, the reality seems to lie somewhere in the middle.

xAI’s Grok 3 vs. OpenAI’s Models

xAI published a blog post showcasing Grok 3’s performance on AIME 2025, a set of advanced math questions derived from an invitational mathematics exam. While some experts question AIME’s reliability as an AI benchmark, it is still widely used to assess a model’s problem-solving skills.

According to xAI’s graph, two versions of Grok 3—Grok 3 Reasoning Beta and Grok 3 mini Reasoning—outperformed OpenAI’s o3-mini-high model on AIME 2025. However, OpenAI employees quickly pointed out a crucial omission: xAI’s comparison did not include the o3-mini-high’s score at “cons@64”.

What is “cons@64,” and Why Does it Matter?

The term “cons@64” stands for “consensus@64,” a method where an AI model is given 64 attempts to answer a question, and the most frequently chosen response is considered the final answer. This technique typically improves a model’s benchmark scores significantly. By leaving out o3-mini-high’s cons@64 score, xAI’s graph may have created the impression that Grok 3 outperformed OpenAI’s model when in reality, the comparison was incomplete.

When looking at the first attempt score, known as “@1”, Grok 3 Reasoning Beta and Grok 3 mini Reasoning actually fall below other 3-mini-high scores. Additionally, Grok 3 Reasoning Beta slightly lags behind OpenAI’s o1 (medium compute) model. Despite this, xAI is marketing Grok 3 as the “world’s smartest AI.”

The Broader AI Benchmarking Debate

Igor Babushkin countered OpenAI’s criticism by stating that OpenAI itself has previously published similarly misleading benchmark charts—though mainly for internal comparisons of its own models. A more neutral observer later compiled a revised graph that included nearly all models’ cons@64 scores, providing a more balanced comparison.

However, as AI researcher Nathan Lambert pointed out, a key piece of information remains unknown: the computational and financial cost required for each model to achieve its best performance. This missing data highlights a major limitation of AI benchmarks—while they provide useful insights, they often fail to convey the full picture of a model’s capabilities, efficiency, and real-world usability.

The ongoing debate between xAI and OpenAI underscores the importance of transparency in AI development. As AI models become more advanced, fair and standardized benchmarking practices will be crucial in accurately evaluating their strengths and weaknesses.

