Jump to content

Your AI Benchmark Scores Are Lying to You: Revision history

Diff selection: Mark the radio buttons of the revisions to compare and hit enter or the button at the bottom.
Legend: (cur) = difference with latest revision, (prev) = difference with preceding revision, m = minor edit.

7 December 2025

  • curprev 17:4017:40, 7 December 2025 PC talk contribs 12,057 bytes +12,057 Created page with "In April 2025, Meta submitted a model to the AI industry’s most watched leaderboard. Llama 4 shot to the top of Chatbot Arena, the crowdsourced battle royale where anonymous models compete for human preference votes. Headlines followed. Celebrations ensued. Then someone looked closer. The model Meta submitted wasn’t the model they planned to ship. It was a variant, carefully tuned for the arena: verbose responses, strategic emoji placement, the textual equivalent o..."