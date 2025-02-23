Austin, February 23: Grok 3 was recently released by Elon Musk's xAI and with promised upgrades to the Grok 2 model. The latest version of Grok was launched via a live video on X, where Elon Musk and the xAI team of engineers showed the benchmarks and capabilities of the new model. All benchmarks showed that the model was powerful and easily beat other leading AI products in the market.

However, despite the grand launch and people moving towards the new xAI model, the debates over Grok 3 benchmarks started. OpenAI of Applied Research Boris Power accused the Grok team of deceiving and cheating people into believing that Grok 3 was a better model than o3-mini. The employee said that OpenAI o3-mini was better in every evaluation compared to the latest Grok 3 model launched by xAI. Deepseek R1 Security Concerns: China’s AI Reasoning Model Fails Multiple Tests, Achieves 9.8 Security Risk Score Out 10, Says Report.

Grok 3 Inherently an o1 Level

If the light blue part is best of N scores, this means that Grok 3 reasoning is inherently an ~o1 level model. This means the capabilities gap between OpenAI and xAI is ~9 months. Also what is the difference between "think" and "big brain" pic.twitter.com/Jw8yk5tEm9 — wh (@nrehiew_) February 18, 2025

o3 Mini Better Than Grok 3 Which xAI Overselling, Argued OpenAI’s Boris Power

Disappointing to see the incentives for the grok team to cheat and deceive in evals. Tl;dr o3-mini is better in every eval compared to grok 3. Grok 3 is genuinely a decent model, but no need to over sell. https://t.co/sJj5ByVikp — Boris Power (@BorisMPower) February 20, 2025

Igor Babuschkin Argued Boris Power, Refuting Accusations

Completely wrong. We just used the same method you guys used 🤷‍♂️ pic.twitter.com/exLcS0z2xI — Igor Babuschkin (@ibab) February 20, 2025

'Grok Looks Good There'

Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it's DeepSeek propaganda (I actually believe Grok looks good there, and openAI's TTC chicanery behind o3-mini-*high*-pass@"""1""" deserves more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic — Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025

Boris Power said, "Grok 3 is genuinely a decent model, but there is no need to oversell." On X, a user posted alleging that Grok 3 reasoning was inherently an 'o1 level model'. The person said that the capabilities gap between OpenAI and xAI was nine months. The X user shared an AIME 2025 Performance chart and highlighted the difference.

On the other hand, xAI's Igor Babuschkin said that the allegations were "Completely Wrong". He said, "We just used the same method you guys used", and shared the benchmark test image again with the AIME 2024 test. However, according to a report by TechCrunch, AIME 2025 and older versions of tests used for determining the model's math capabilities were not that reliable. It said that some questioned the AIME's validity.

On the other hand, an OpenAI employee said that xAI did not include an AIME 2025 graph of o3-mini-high at cons@24 (consensus@64), meaning "running a query through the model 64 times and marking it correct if the most common output is correct." The report mentioned that OpenAI previously had similar misleading benchmark chats comparing its own models. Grok Voice Mode Released: Elon Musk Announces Rolling Out Highly Anticipated Voice Support on Grok App As Beta, Memory Feature Coming Soon.

On the other hand, another user said, "It's Hilarious how some people see my plot as an attack on OpenAI and others as an attack on Grok, while in reality it's DeepSeek propaganda." The user said Grok 3 looked good there, and OpenAI's model was behind the benchmarks.

