<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://johnwick.cc/index.php?action=history&amp;feed=atom&amp;title=This_embarrassingly_simple_secret_explains_all_of_AI</id>
	<title>This embarrassingly simple secret explains all of AI - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://johnwick.cc/index.php?action=history&amp;feed=atom&amp;title=This_embarrassingly_simple_secret_explains_all_of_AI"/>
	<link rel="alternate" type="text/html" href="https://johnwick.cc/index.php?title=This_embarrassingly_simple_secret_explains_all_of_AI&amp;action=history"/>
	<updated>2026-05-07T06:14:26Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.44.1</generator>
	<entry>
		<id>https://johnwick.cc/index.php?title=This_embarrassingly_simple_secret_explains_all_of_AI&amp;diff=2454&amp;oldid=prev</id>
		<title>PC: Created page with &quot;500px  Credits: Leonardo AI In college, there was a course called “PRML”, the infamous ML elective that pretty much everyone wanted to take. When I first heard about it, my seniors told me to take probability first. But why was probability important for ML? I had no idea.    If you like learning AI concepts through easy-to-understand diagrams, I’ve created a free resource that organises all my work i...&quot;</title>
		<link rel="alternate" type="text/html" href="https://johnwick.cc/index.php?title=This_embarrassingly_simple_secret_explains_all_of_AI&amp;diff=2454&amp;oldid=prev"/>
		<updated>2025-12-07T03:22:07Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;&lt;a href=&quot;/index.php?title=File:This_embarrassingly_simple_secret_explains_all_of_AI.jpg&quot; title=&quot;File:This embarrassingly simple secret explains all of AI.jpg&quot;&gt;500px&lt;/a&gt;  Credits: Leonardo AI In college, there was a course called “PRML”, the infamous ML elective that pretty much everyone wanted to take. When I first heard about it, my seniors told me to take probability first. But why was probability important for ML? I had no idea.    If you like learning AI concepts through easy-to-understand diagrams, I’ve created a free resource that organises all my work i...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;[[file:This_embarrassingly_simple_secret_explains_all_of_AI.jpg|500px]]&lt;br /&gt;
&lt;br /&gt;
Credits: Leonardo AI&lt;br /&gt;
In college, there was a course called “PRML”, the infamous ML elective that pretty much everyone wanted to take.&lt;br /&gt;
When I first heard about it, my seniors told me to take probability first.&lt;br /&gt;
But why was probability important for ML?&lt;br /&gt;
I had no idea.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you like learning AI concepts through easy-to-understand diagrams, I’ve created a free resource that organises all my work in one place — feel free to check it out!&lt;br /&gt;
Your free one-stop guide to AI in 2025&lt;br /&gt;
The only guide you&amp;#039;ll ever need to master AI and LLMs.&lt;br /&gt;
nikhilanand03.substack.com&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When I finally learnt probability, I realised something most ML textbooks don’t spell out:&lt;br /&gt;
&lt;br /&gt;
Every single AI model can be seen as a probability problem.&lt;br /&gt;
&lt;br /&gt;
When I realised that, the field became so much more intuitive.&lt;br /&gt;
In this blog, you’ll learn something I wish I knew when I first started: Probability is literally the foundation that the whole field of AI stands on.&lt;br /&gt;
&lt;br /&gt;
Meet Mario (and his mystery machines)&lt;br /&gt;
&lt;br /&gt;
First, let’s say “Hi” to Mario.&lt;br /&gt;
&lt;br /&gt;
Suppose Mario encounters a strange machine in the Mushroom Kingdom.&lt;br /&gt;
Every time he presses a button, the machine spits out either a Goomba or a One-Up mushroom.&lt;br /&gt;
&lt;br /&gt;
A machine spits out either a goomba or a 1-up mushroom every time you press the button. (Image by author)&lt;br /&gt;
&lt;br /&gt;
Let’s say the machine has some internal parameter p that represents the probability of getting a 1-up.&lt;br /&gt;
For instance, if p = 0.5, the machine flips a fair coin, heads gives a 1-up, tails gives a Goomba.&lt;br /&gt;
&lt;br /&gt;
Now to keep it simple, let’s represent the goomba with the number 0, and the 1-up with the number 1.&lt;br /&gt;
&lt;br /&gt;
Here’s how the probability distribution looks for the machine.&lt;br /&gt;
&lt;br /&gt;
[[file:Probability_distribution_looks_for_the_machine.jpg|650px]]&lt;br /&gt;
&lt;br /&gt;
Now if Mario has a lot of machines, he wants to be able to pick the machine with the highest probability of outputting a 1-up.&lt;br /&gt;
&lt;br /&gt;
But he needs to know p for that, which he doesn’t.&lt;br /&gt;
How does he figure out what p is?&lt;br /&gt;
&lt;br /&gt;
The challenge: estimating p&lt;br /&gt;
&lt;br /&gt;
Mario decides to run an experiment.&lt;br /&gt;
For one machine, he presses the button 8 times and gets these outputs:&lt;br /&gt;
&lt;br /&gt;
{1,0,0,1,0,0,0,0,}&lt;br /&gt;
&lt;br /&gt;
What do you think p is?&lt;br /&gt;
Mario’s intuition says 1/4. He got 2 1-ups in 8 tries, which means roughly 1 1-up every 4 tries.&lt;br /&gt;
&lt;br /&gt;
But Mario wanted to be rigorous. In his tough world, there was no room for error.&lt;br /&gt;
&lt;br /&gt;
Why intuition alone isn’t enough&lt;br /&gt;
Here’s the thing: p could still be anything.&lt;br /&gt;
&lt;br /&gt;
For instance, let’s say p=0.5. That means both 1 and 0 are equally likely.&lt;br /&gt;
&lt;br /&gt;
Even at p = 0.5, it’s still possible to get that exact sequence {1, 0, 0, 1, 0, 0, 0, 0}.&lt;br /&gt;
At every turn, the probability of that number being outputted is 0.5.&lt;br /&gt;
&lt;br /&gt;
At every push of the red button, the probability of either 1 or 0 coming out is 0.5. (Image by author)&lt;br /&gt;
So if we multiply 0.5 by itself 8 times, we get the total probability of that exact sequence.&lt;br /&gt;
)&lt;br /&gt;
Introducing the likelihood function&lt;br /&gt;
At this point, what we know is that we got a sequence {1, 0, 0, 1, 0, 0, 0, 0} from the machine.&lt;br /&gt;
If we assume the parameter of p, what is the likelihood that we would get this sequence?&lt;br /&gt;
&lt;br /&gt;
At every push of the red button:&lt;br /&gt;
* 		To get a 1, there’s a probability of p.&lt;br /&gt;
* 		To get a 0, there’s a probability of (1-p).&lt;br /&gt;
&lt;br /&gt;
The likelihood of observing this sequence, given some value of p, is:&lt;br /&gt;
&lt;br /&gt;
[[file:The_likelihood_of_observing_this_sequence.jpg|650px]]&lt;br /&gt;
&lt;br /&gt;
L(A|B) reads as “likelihood of getting sequence A given that the parameter is p”. (Image by author)&lt;br /&gt;
L( {1,0,0…} | p ) reads as, the “likelihood of a sequence {1,0,0…} given a parameter of p.&lt;br /&gt;
This is called the likelihood function.&lt;br /&gt;
It tells us: for a given value of p, how likely were we to observe this data?&lt;br /&gt;
&lt;br /&gt;
Maximum likelihood estimator&lt;br /&gt;
&lt;br /&gt;
One of several ways to estimate p is to simply find the value of p at which the likelihood is maximised. Aptly called the “maximum likelihood estimator”.&lt;br /&gt;
And to find the value of p, we ask this question:&lt;br /&gt;
At what value of p is this likelihood maximised?&lt;br /&gt;
At this point, the likelihood function looks like this:&lt;br /&gt;
&lt;br /&gt;
[[file:At_this_point,_the_likelihood_function.jpg|650px]]&lt;br /&gt;
&lt;br /&gt;
Mario plotted this function to visualise it.&lt;br /&gt;
&lt;br /&gt;
For any value of p, the y-axis tells us the likelihood of getting the sequence that we got. (Image by author)&lt;br /&gt;
The peak occurs at p = 0.25.&lt;br /&gt;
&lt;br /&gt;
Wait. Mario was right earlier!&lt;br /&gt;
His intuition of 1/4 was correct.&lt;br /&gt;
But is it always this simple?&lt;br /&gt;
The general case&lt;br /&gt;
Let’s say we get a sequence of n outputs: {x₁, x₂, x₃, …, xₙ}, where each xᵢ is either 0 or 1.&lt;br /&gt;
&lt;br /&gt;
What’s the likelihood of this sequence?&lt;br /&gt;
Well, for any xᵢ = 0, the probability is (1-p).&lt;br /&gt;
For xᵢ = 1, the probability is p.&lt;br /&gt;
&lt;br /&gt;
We can write this compactly as:&lt;br /&gt;
&lt;br /&gt;
[[file:We_can_write_this_compactly_as.jpg|650px]]&lt;br /&gt;
&lt;br /&gt;
If you’re wondering how we got to this compact expression, think about it:&lt;br /&gt;
For any 0 in any sequence, we’d set x_i to 0 and the expression would become (1-p). And for any x_i=1, the expression would be p.&lt;br /&gt;
&lt;br /&gt;
To get the likelihood of the entire sequence, we multiply these probabilities across all data points:&lt;br /&gt;
&lt;br /&gt;
[[file:Pi_represents_a_product.jpg|650px]]&lt;br /&gt;
&lt;br /&gt;
Pi represents a product, which simply is equivalent to multiplying those terms (p or 1-p) across the n different datapoints. (Image by author)&lt;br /&gt;
Now we have a function that depends on both p and our observed data points xᵢ.&lt;br /&gt;
The optimisation problem&lt;br /&gt;
We want to find the p that maximises this likelihood.&lt;br /&gt;
We can write this as an optimisation problem:&lt;br /&gt;
&lt;br /&gt;
[[file:An_optimisation_problem.jpg|650px]]&lt;br /&gt;
&lt;br /&gt;
The optimal parameter p is the one that maximises the likelihood function. (Image by author)&lt;br /&gt;
Here’s a trick: since the logarithm is a monotonically increasing function, maximising L(p) is the same as maximising log(L(p)).&lt;br /&gt;
&lt;br /&gt;
Taking the log:&lt;br /&gt;
&lt;br /&gt;
[[file:Taking_the_log.jpg|650px]]&lt;br /&gt;
&lt;br /&gt;
Do you recognise this? This is the categorical cross-entropy loss function that neural networks use during training.&lt;br /&gt;
It looks complicated, but now you know that it stems from the most simple idea: maximise the likelihood of seeing the data.&lt;br /&gt;
&lt;br /&gt;
Solving for p&lt;br /&gt;
&lt;br /&gt;
Anyway, to optimise any function, we take the derivative and set it to zero.&lt;br /&gt;
If you’re unfamiliar with derivatives, check this out:&lt;br /&gt;
&lt;br /&gt;
Understanding derivatives from a falling apple&lt;br /&gt;
Apples, derivatives and AI have more in common than you think.&lt;br /&gt;
open.substack.com&lt;br /&gt;
&lt;br /&gt;
Setting it to zero and simplifying, we get:&lt;br /&gt;
&lt;br /&gt;
[[File:Setting it to zero and simplifying.jpg|650px]]&lt;br /&gt;
&lt;br /&gt;
And simplifying some more:&lt;br /&gt;
&lt;br /&gt;
[[file:And_simplifying_some_more.jpg|650px]]&lt;br /&gt;
&lt;br /&gt;
In other words, p equals the average of all the xᵢ values.&lt;br /&gt;
&lt;br /&gt;
That’s pretty intuitive&lt;br /&gt;
&lt;br /&gt;
The maximum likelihood estimate is just the average!&lt;br /&gt;
For example, if Mario gets the sequence {0, 1, 0, 0, 0, 1, 0, 1, 0}, the most likely p value is 3/9 = 0.33, since there are three 1s and six 0s.&lt;br /&gt;
Or if he gets {1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0}, the estimate is 7/13 ≈ 0.54.&lt;br /&gt;
Simple.&lt;br /&gt;
&lt;br /&gt;
Why does this matter in AI?&lt;br /&gt;
&lt;br /&gt;
So far, the model we looked at just estimated a distribution in general.&lt;br /&gt;
It’s unsupervised.&lt;br /&gt;
&lt;br /&gt;
There’s no “input” to the model, just an output when you press a button.&lt;br /&gt;
But what if the machine Mario was using was an LLM?&lt;br /&gt;
&lt;br /&gt;
LLMs are input-conditioned machines&lt;br /&gt;
&lt;br /&gt;
An LLM is an “input-conditioned machine”.&lt;br /&gt;
Mario needs to first pass a sentence as input before getting the output.&lt;br /&gt;
&lt;br /&gt;
The distribution is now conditioned on the input.&lt;br /&gt;
So when I ask “Who is POTUS?” I get one distribution, but when I ask “Who is the CEO of Apple?” I get anohter.&lt;br /&gt;
&lt;br /&gt;
The act of training the LLM is essentially the same as estimating this distribution (which is equivalent to finding the value of p like we did earlier).&lt;br /&gt;
Let me explain why.&lt;br /&gt;
&lt;br /&gt;
Training an LLM = estimating a distribution&lt;br /&gt;
&lt;br /&gt;
When you ask “Who is POTUS?” to an LLM, it internally frames a distribution f(xᵢ | “Who is POTUS?”).&lt;br /&gt;
Let’s say during training, the LLM has seen several paragraphs like this.&lt;br /&gt;
&lt;br /&gt;
[[file:LLM_has_seen_several_paragraphs.jpg|650px]]&lt;br /&gt;
&lt;br /&gt;
It has seen Trump as a continuation of that specific phrase twice and it has seen Obama as a continuation once.&lt;br /&gt;
&lt;br /&gt;
So similar to pressing the button on the machine, we get a list of all the possible answers the LLM has seen before:&lt;br /&gt;
&lt;br /&gt;
{“obama”,”Trump”,”Trump”,…}&lt;br /&gt;
&lt;br /&gt;
When we’re training the LLM, it estimates its distribution simply based off of what it has seen before.&lt;br /&gt;
Using the same MLE approach, the model estimates:&lt;br /&gt;
* 		p(Trump | “Who is POTUS?”) = 2/3 = 0.66&lt;br /&gt;
* 		p(Obama | “Who is POTUS?”) = 1/3 = 0.33&lt;br /&gt;
&lt;br /&gt;
Probability of “obama” is 1/3 and probability of “trump” is 2/3. (Image by author)&lt;br /&gt;
When you think of LLM training this way, it’s a lot easier to see why it works, instead of looking at it from the perspective of some abstract “categorical cross-entropy minimisation”.&lt;br /&gt;
&lt;br /&gt;
The big picture&lt;br /&gt;
&lt;br /&gt;
When you interact with ChatGPT or Claude, you’re essentially querying a massive probability distribution.&lt;br /&gt;
The model has seen billions of text sequences during training. For each possible input, it’s estimated the distribution of next tokens that maximises the likelihood of all that training data.&lt;br /&gt;
That’s Maximum Likelihood Estimation at scale.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you like learning AI concepts through easy-to-understand diagrams, I’ve created a free resource that organises all my work in one place — feel free to check it out!&lt;br /&gt;
&lt;br /&gt;
Your free one-stop guide to AI in 2025&lt;br /&gt;
The only guide you&amp;#039;ll ever need to master AI and LLMs.&lt;br /&gt;
open.substack.com=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Conclusion&lt;br /&gt;
&lt;br /&gt;
In this blog, we covered how probability and Maximum Likelihood Estimation form the foundation of how language models learn.&lt;br /&gt;
&lt;br /&gt;
We started with Mario and his mystery machines to build intuition for the problem. Then we formalised it mathematically, showing that the MLE estimate is simply the average of observed outcomes.&lt;br /&gt;
Finally, we connected this to LLMs, showing that categorical cross-entropy, the loss function used in training, is just MLE in disguise.&lt;br /&gt;
&lt;br /&gt;
The key insight? When an LLM predicts the next word, it’s not doing anything magical. It’s estimating probability distributions that best explain the training data it’s seen.&lt;br /&gt;
This same principle, maximising likelihood, underlies nearly every machine learning model, from simple classifiers to GPT-4.&lt;br /&gt;
&lt;br /&gt;
Read the full article here: https://ai.gopubby.com/stop-learning-ai-backwards-e20bd7d7d2cb&lt;/div&gt;</summary>
		<author><name>PC</name></author>
	</entry>
</feed>