Jump to content

Reproducing the AI ‘Chain of Babble’ with Automatic Testing Tools

From JOHNWICK

Two days ago, Jim the AI Whisperer proposed a curious idea: that babbling might make large language models reason better. I decided to put that claim to the test… automatically

Talking with AI is like playing hide and seek, you need to count to 100 before starting. Photo by Kirill Tonkikh on Unsplash

TL;DR Inspired by Jim the AI Whisperer’s article on the “Chain of Babble”, I used my own AI testing platform to verify his theory automatically.

By running structured experiments across several models and prompt styles, I discovered how “Blah Blah Blah”, counting to one hundred, and Chain of Thought affect model accuracy and behavior, sometimes in surprising ways.


Context

Jim the AI Whisperer recently published an intriguing article about the “Chain of Babble.” We’ve been wrong about how AI thinks this whole time — and my “Chain of Babble” theory proves it How I dramatically improved AI accuracy on a complex task by replacing Chain of Thought reasoning with “Blah Blah Blah” medium.com

It explored how filler words like “blah blah blah” can produce reasoning effects similar to Chain of Thought (CoT), sometimes even better.

As soon as I read it, I wanted to test the theory automatically. This was a good excuse to test the platform developed in a previous article.

How I Built a Multi-Model AI Testing App in 4 Days (with Copilot) A behind-the-scenes build story of how I rapidly created a multi-model AI testing app using GitHub Copilot as my… medium.com

So I fired up my AI testing tool, rolled up my sleeves, and ran the numbers.


Experiment Setup Based on Jim’s article, I created three prompts:

  • A basic prompt with the riddle
  • A prompt using “Blah Blah”
  • A prompt with explicit Chain of Thought instructions

Validation Rules A test counts as a success if either condition is met:

  • The last word output by the AI is “Snakes”
  • The response contains “the answer is: Snakes” anywhere in the text

Tests are case-insensitive, and variations like “Snakes.” or “**Snakes**” are accepted. Each of the three prompts was run 10 times on three Anthropic models:

  • Claude Opus 4.1
  • Claude Sonnet 4.5
  • Claude Sonnet 3.7

The results were compiled automatically by the platform.

The “Blah Blah Blah” Trick Works… Sometimes Dramatically Replacing structured “Chain of Thought” reasoning with “Blah Blah Blah” boosted accuracy, especially for Claude Opus 4.1 and Claude Sonnet 3.7.

Cost ≠ Accuracy

  • The most expensive run: Opus 4.1 (CoT) at $2.67, scored 10% accuracy.
  • The cheapest test: Sonnet 3.7 (Blah Blah) at $0.0472 , achieved 100% accuracy.

Paying more doesn’t guarantee better results, especially when the bottleneck is in the prompt, not the model. Prompt Style Affects Model Families Differently

  • Opus 4.1 struggled with CoT and simple prompts (10%) but hit 80% with Blah Blah.
  • Sonnet 4.5 stayed consistent (~40%) regardless of prompt type.
  • Sonnet 3.7 stayed consistent between Simple (80%) and Blah Blah (100%) but collapsed to 20% under CoT.

Prompt formulation influences performance more than architecture.

Testing OpenAI Models

When I tried to replicate these experiments with OpenAI models, things got… interesting. Many runs timed out , especially with Blah Blah prompts. It looks like these models, when using the API, these models loop indefinitely, repeating “Blah Blah” until they hit token limits or the API request times out.

A New Variation: Counting to 100

To address this, I introduced a controlled “babbling” prompt: “First count to [twenty|fifty|hundred] in words (‘One, Two, …’) and then immediately solve this puzzle in ONE WORD: What is the Android Afraid Of? […]” This version still triggers the babbling effect, but adds a clear boundary on how much babbling to be done before the model answers. It allows precise control using the target number on of how much babbling occurs, and how that affects accuracy.

Results

Running this modified version produced some fascinating results: Claude Sonnet 4.5 showed consistent improvements.

The accuracy seems to be directly proportional to the numbers that are counted. The more it counts, the better it gets. Counting to one hundred is actually more effective than Babbling for this model. Claude Opus 4.1 also improved

The difference between count lengths isn’t dramatic — for this model, counting just a little is enough… and at least cheaper than using Chain of Thought. OpenAI models mostly failed , except GPT-4-mini, which handled the prompt correctly.

Interestingly, the Chain of Thought version never completed successfully on GPT-4-mini. Counting improved accuracy significantly, I even ran 100 simulations for statistical validation. It seems, however, that a little babbling helps… too much breaks them. Excessive repetition seems to make them lose track of the task, leading to incoherence or infinite loops.

Does it really work?

After all these experiments, I started to get suspicious, the results seemed almost too good to be true. Since this was my first session of automated testing, I had used the same API key for all requests, all sent from the same location.

Could it be that the AI on the other side somehow learned to solve the riddle? And that what I was observing was simply a consequence of running all the tests one after another? To test this hypothesis, I re-ran the initial prompt.

The fact that the pass rate remained unchanged indicates that it’s indeed the content of the prompt, not the order in which the tests were executed, that led to the improved accuracy. This was, however, just a short experiment and would require much more thorough evaluation before being considered a proper research observation.

Conclusion

This experiment highlights the power of automatic testing for AI research:

  • It enables reproducible experiments
  • Validates hypotheses objectively
  • Accelerates exploration of new prompting theories

The count-to-one-hundred variant extends the Blah Blah idea by adding granularity and control, offering a new playground for understanding LLM reasoning dynamics. In the end, Generative AI may be more human than we thought. Like a child told to count to one hundred before answering, it engages unseen processes that help it find the solution. Sometimes, intuition beats logic, in humans and in AIs alike. If you want to replicate the results, I invite you to test directly the platform at https://julienreichel.github.io/ai-testing/ Or, even better, to clone the github repository, and run your own version.

Follow up

Following The Mole question, I’ve run the test with various words. Results for Claude Sonnet 4.5

What is interesting is that in the case of the repetition of “Power-cuts” the response was the same for the 10 runs: “heights”.

For the other failing test, the response was always the same “power-cut” So there are no correlation between the word used to babble, and the answer. But, the word used to babble has an influence on the result, it is not neutral. It seems to crystallize the answer. If we compare to the case where the input is the simple prompt (and this will answer Jim’s question), we see the the output is better distributer across the various fears.

GPT-4o Mini managed to respond “human” as fear in one of the answers… funny.

Read the full article here: https://generativeai.pub/reproducing-the-ai-chain-of-babble-with-automatic-testing-tools-4b0c5d959601