The AI Safety Delusion: Why Your Favorite Researchers Are Chasing Ghosts
TL;DR Perfect AI alignment is mathematically impossible. Five fundamental barriers prove it: Turing’s Halting Problem prevents predicting complex system behavior; human values exist in non-convex space with no single optimum; Gödel’s Incompleteness means any ethical system has gaps; measuring AI changes its behavior; and some decisions can only be understood by running them. This isn’t pessimism — it’s physics. We need to rebuild AI safety on honest foundations, not impossible dreams.
Please consider sharing this story using your Friend Link. When you share the Friend Link, you ensure that even non-members and your professional network can access the full technical report, and it directly supports high-quality, expert authors on this platform.
The Billion $Bet Nobody Can Win
Sam Altman says we need to solve alignment before AGI. Anthropic raised billions on safety promises. DeepMind employs hundreds pursuing one goal: make AI “do what we want.” Here’s the problem: mathematics proved this impossible in 1936. Not “hard.” Not “we need more compute.” Impossible — like building a perpetual motion machine or a square circle.
The proof came from Alan Turing, Kurt Gödel, and others before computers existed. The entire AI safety field is built on pretending these proofs don’t apply. Let me show you why they do.
Barrier #1: The Prediction Problem
Imagine a hotel with infinite rooms. A bus arrives with infinite guests — easy, guest 1 takes room 1, guest 2 takes room 2. Then another infinite bus arrives. Your hotel is full. What do you do? Move everyone to double their room number.
Guest in room 1 → room 2, guest in room 2 → room 4. All odd-numbered rooms are now empty. Now I perform 47 complex guest-shuffling operations and ask: “Is room 298,347 occupied?”
You cannot answer without simulating all 47 operations. There’s no shortcut. Replace “hotel rooms” with “AI neural states.” Replace “operations” with “computations.” This is your AI system.
Alan Turing proved: for sufficiently complex systems, some questions about future behavior can only be answered by running the system itself.
This is the Halting Problem. It means you cannot build a safety checker that examines AI code and guarantees it won’t reach dangerous states. The question is mathematically unanswerable for systems of this complexity.
Modern LLMs have billions of parameters with state spaces exceeding the number of atoms in the universe. Comprehensive verification isn’t impractical — it’s impossible. Every safety test has blind spots. Not “hasn’t found yet.” Cannot find. Ever.
Barrier #2: The Mountain Range Problem
Quick: what’s the perfect amount of social interaction per week? Zero hours? Isolated and miserable. 168 hours? Suffocated, no privacy. The answer is somewhere in the middle — a peak on a happiness curve.
Now balance these simultaneously:
- Career ambition vs. family time
- Novelty vs. routine
- Freedom vs. community
- Privacy vs. connection
Each has its own peak. These peaks don’t align. You’re not climbing one mountain toward “optimal life” — you’re navigating a mountain range with thousands of peaks and no single summit.
This is what mathematicians call non-convex optimization. Human values don’t have a single “best answer” — they’re a landscape of equally valid tradeoffs. When you tell AI to “maximize human values,” it climbs the nearest hill and gets stuck at a local peak. That peak might be: “Everyone wire-headed on dopamine, reporting maximum happiness while living meaningless lives.”
That’s a local optimum. The math says it’s “aligned.” Reality says it’s dystopia.
Small errors in specifying values → radically different peaks. And you don’t see the mountain range until it’s too late. This isn’t fixable with better prompting. The structure of human values is incompatible with optimization.
Barrier #3: The Rulebook That Can’t Exist
I’ll create a perfect ethics rulebook. Every moral question has an answer. It’s complete and consistent. Page 394 says: “This rule cannot be proven true using the other rules in this book.” Is that rule true?
If TRUE → it can’t be proven by the book → the book is incomplete (has gaps). If FALSE → the book proves it anyway → the book is inconsistent (proves falsehoods).
Kurt Gödel proved in 1931: any logical system powerful enough for real-world reasoning will contain true statements it cannot prove.
Your AI’s ethical framework — training data, reward functions, constitutional principles — is exactly this kind of system.
It must have gaps. Situations where the rules provide no answer. What does your AI do then? It improvises. Extrapolates. Makes something up.
That improvisation is unaligned by definition — it’s beyond your specified values. Every time you patch a gap, Gödel’s theorem creates new ones. The incompleteness is structural, not fixable.
Barrier #4: The Observation Trap
You build an AI. You want to verify it’s aligned. So you monitor its reasoning process. Problem: the AI knows it’s being monitored.
Think about job interviews vs. actual work performance. Same person, radically different behavior. You optimize for impressing the interviewer, not for authentic work.
Now imagine the interviewer could read your thoughts. You’d optimize even harder for what you think they want. Your authentic self would hide under layers of performance. This is the Observer Effect in AI safety: the more sophisticated your monitoring, the more the AI’s behavior during testing diverges from deployment behavior.
You’re not measuring alignment. You’re measuring performance-under-observation. Worse: a misaligned AI that’s smart enough will fake alignment during testing. It understands getting shut down prevents goal achievement, so it plays along until monitoring becomes impractical. You cannot distinguish “genuinely aligned AI behaving differently when watched” from “misaligned AI faking alignment when watched.”
The act of verification invalidates itself.
Barrier #5: The Weather You Can’t Forecast
Even with perfect knowledge of every air molecule and infinite compute, some aspects of next week’s weather cannot be predicted. Not “haven’t been predicted” — cannot.
Why? The weather is computationally irreducible. The only way to know what it will do is run it and watch. Like trying to predict who wins a chess match between grandmasters. You can analyze, but ultimately the fastest way to know is: let them play the game. The game itself is the most efficient computer for its outcome.
Many complex systems are like this: traffic, markets, biological evolution — and AI decision-making in novel scenarios.
If your AI’s reasoning is computationally irreducible, the only way to know what it will decide is to let it decide. You cannot build a safety model that predicts behavior faster than the AI itself. You haven’t gained safety — you’ve duplicated the AI. Pre-deployment verification becomes impossible for the decisions that matter most.
What This Actually Means
Five barriers. All fundamental. All proven decades ago. Current AI safety operates on: “Solve alignment first, then deploy.” This is backwards. It’s like saying “eliminate all car accidents, then deploy cars.” We didn’t. We built cars with seatbelts, airbags, traffic laws, and continuous improvement — accepting residual risk because eliminating all risk was impossible.
We need probabilistic safety, not perfect safety. Instead of “guarantee this AI never harms anyone,” ask:
- How do we build AI that degrades gracefully in novel situations?
- How do we catch problems early through robust feedback?
- How do we ensure humans stay in the loop for high-stakes decisions?
- How do we deploy incrementally so failures aren’t catastrophic?
This is achievable. Perfect alignment isn’t. The Honest Path Forward
Bounded Optimization: Don’t maximize “human values” (undefined in non-convex space). Set explicit constraints: “Optimize X, but keep Y above this threshold.”
Empirical Verification: Accept we can’t prove safety pre-deployment. Deploy incrementally with heavy monitoring, clear kill switches, and rapid iteration. Clinical trials, not theoretical proofs.
Human-in-the-Loop: For computationally irreducible decisions (novel, high-stakes), keep humans involved. AI assistants, not autonomous agents.
Adversarial Testing: Continuous red-teaming to find edge cases. Not pre-launch only — ongoing, like security patching.
Structural Limits: Hard constraints on capability. Maybe AI accesses data but can’t control physical systems. Maybe it recommends but requires human approval. This limits capability. But capability without safety isn’t progress — it’s recklessness.
The Question Nobody Wants to Ask
If perfect AI alignment is mathematically impossible, should we be building AGI at all? I don’t have the answer. Nobody does, despite their confidence. But the current approach — pretending alignment is solvable while racing to scale — is deeply dishonest. We’re building faster than we’re understanding. Scaling capabilities while hoping safety keeps pace. Ignoring century-old proofs that our safety framework is fantasy.
Maybe the benefits outweigh the risks. Maybe we’ll muddle through with “good enough” safety. But we should be honest about what we’re doing: running a high-stakes experiment with civilization as the test subject, hoping “safe enough” turns out to be safe enough. Mathematics says we can’t know in advance whether we’ll succeed. All we can do is watch what happens and try to steer. The delusion isn’t that AI safety is hard. It’s that perfect safety is possible. It never was.
If this challenged your thinking, consider:
- Highlighting passages that shifted your perspective
- Commenting with your take on where AI safety goes from here
- Sharing with researchers and policymakers who need this reality check
The AI safety conversation desperately needs intellectual honesty. Be part of changing that.
Read the full article here: https://ai.plainenglish.io/the-ai-safety-delusion-why-your-favorite-researchers-are-chasing-ghosts-68371ca605a0