AI Urgently Needs an Operating System No One Is Building

Why You Should Read This (and TL;DR) If you’re an AI engineer, ML practitioner, or programmer trying to ride the AI wave, you’ve probably felt it: the ground under your feet is moving. The tools change weekly. The failures are unpredictable. Agent frameworks look promising — until they collapse in production. One day you feel ahead; the next day it feels like the entire field jumped sideways without you.

Here’s the important part: it’s not your fault. The current AI stack is missing the exact layer that would give you stability, reliability, and engineering clarity. This story is not another vague rant. It’s a map: a concrete, engineerable architecture that lets you regain control over models and agent swarms.

If you internalize the ideas here, you’ll be ahead of most practitioners, because you’ll know how to:

reason about models at the mathematical level,
build agent systems that don’t explode into chaos,
design workflows with rollback, checks, and guarantees,
and understand what the coming AI Operating System will look like — and how to build toward it.

AI is the biggest market with the highest growth potential in the world, but right now it’s still unstable, half-built, and wide open.

The people who understand the missing kernel — mathematical, geometric, and orchestration-level — will be the ones shaping this industry in the coming years. Read on.  Your future self as an AI professional will thank you. TL;DR

If your models behave weirdly in production and your agents keep spiraling into chaos, you’re not incompetent, the stack is missing a layer.
We need an AI Kernel: a real operating-system-style layer between PyTorch/JAX and raw linear algebra that enforces jet-based derivative types, geometric consistency, composition laws, and topological isolation.
On top of that, we need APEL, an Agent Process Execution Language: a modern BPEL-for-agents that brings compensation, typed contracts, correlation, checkpoints, and observability to multi-agent workflows.
The good news: the math and distributed-systems theory already exist. You don’t need to reinvent the wheel — this article gives you the architectural blueprint, and the references at the end point you to the foundational papers and standards. Jets, geometric deep learning, Sagas, BPEL — the building blocks are documented. What’s missing is someone putting them together.

Why Current Agent Frameworks Can’t Fix This LangChain, AutoGPT, CrewAI, AutoGen.. they’re all building on the wrong abstraction layer. They’re Python libraries that provide convenience functions for calling agents. They’re not process kernels.

The difference matters: • Libraries are optional. You can bypass them. Kernels are mandatory, all agent calls go through the kernel. • Libraries provide utilities. Kernels provide guarantees. • Libraries hope you follow best practices. Kernels enforce invariants. • Libraries make things easier. Kernels make things possible. You can’t retrofit compensation semantics onto LangChain. You can’t add real correlation sets to CrewAI. These require kernel-level primitives that don’t exist in any current framework.

The Missing Layer in Every AI Stack

Here’s what a production AI system actually runs on:

Figure 1. The hollow middle of today’s AI stack Here’s what a production AI stack actually looks like today: every layer below the framework has real infrastructure : CUDA manages your GPU, Linux manages your processes, decades of engineering hold it together. Now look at the middle. That’s not a placeholder; that’s what’s actually there: nothing. The crossed-out items aren’t a design choice, they’re the parts nobody has built yet. Figure created by the author using Manim

See that gap? Between the framework API and the raw linear algebra, there’s no intermediate structure:

No mathematical type system.
No geometric consistency layer.
No compositional verification.
And most critically for 2025: no AI agent orchestration engine.

That gap is where the AI Kernel should be. But before we explore further the components, let’s see what the complete stack would look like with those gaps filled.

The Complete Stack: OS + AI Kernel* Here’s what a production AI system should run on, with the AI Kernel filling the gap. In the figure (conceptually):

Left: the stack you’re shipping to production today.
Right: the layer that should exist but doesn’t.

We have kernels for hardware (CUDA), kernels for operating systems (Linux), kernels for literally everything… except the mathematical and orchestration layer where AI actually lives.

The dotted box isn’t a wishlist. It’s technical debt we’re accumulating with every deployment — whether it’s a single AI application or a workflow of synchronous and asynchronous AI agents waiting to be orchestrated (i.e., via APEL).

Figure 2. The layer that doesn’t exist Your app sits on PyTorch. PyTorch sits on CUDA. CUDA sits on Linux. Linux sits on hardware. Every layer has a kernel providing guarantees: memory protection, process isolation, resource management. Now look at the zoomed box: that’s what should sit between your framework and the raw linear algebra. Jet types for derivative verification. Geometric awareness for non-Euclidean embeddings. Composition laws that actually compose. Agent orchestration that doesn’t devolve into infinite loops. It’s not radical, just the obvious layer nobody’s building. Figure created by the author using Manim

Note: By “AI operating system” we mean a kernel for AI, not for hardware: an intermediate layer between frameworks and linear algebra that doesn’t manage GPUs, but protects models from bad math and agents from chaotic workflows, with geometric consistency, compositional verification, and agent orchestration. Today, that kernel simply doesn’t exist.

The Five Components of an AI Kernel Just as a traditional OS kernel provides memory management, process isolation, file systems, and security — the AI Kernel would provide five analogous services for intelligent computation.

1. Jet-Extended Type System The problem: Current autodiff systems compute derivatives but immediately discard the algebraic relationships between them. When you chain two layers together, the chain rule applies — but nothing verifies it applied correctly or preserves the derivative structure.

The mathematics: A k-jet at point x is the equivalence class of all functions sharing derivatives up to order k:

jₓᵏ(f) = jₓᵏ(g) ⟺ ∂ᵅf(x) = ∂ᵅg(x) for all |α| ≤ k

Jets form an algebra: you can add, multiply, and compose them with well-defined rules. This means “value plus all derivatives up to order k” becomes a first-class mathematical object — not just a number, but a structured type the kernel can verify and manipulate.

What this enables:

Automatic detection of vanishing/exploding gradients before they cause training instability.
Verified chain rule application across layer composition.
Higher-order optimization with correctness guarantees.

2. Geometric Layer The problem: Everything in current deep learning lives in flat Euclidean space ℝⁿ.

Hierarchies (inherently hyperbolic)
Periodic patterns (inherently toroidal)
Probability distributions (inherently curved)

The evidence: Poincaré embeddings (Nickel & Kiela, 2017) demonstrated that hyperbolic space embeds tree structures with exponentially less distortion than Euclidean space. A tree with n nodes requires O(n) dimensions in ℝⁿ but only O(log n) dimensions in hyperbolic space. This isn’t theoretical — it’s measured on real datasets.

What the kernel provides: Explicit manifold types with native operations: geodesic distances, exponential maps, parallel transport. Automatic consistency checks when operations mix geometric contexts. Curvature tracking that flags when learned representations develop pathological geometry.

3. Composition Verification The problem: Neural network layers compose by stacking. Dimensions match syntactically, but nothing verifies semantic correctness: Does fine-tuning preserve the safety properties you trained into the base model? Does merging two LoRA adapters produce consistent behavior? Currently, you deploy and pray. 🤔

The category-theoretic perspective: Layers are morphisms between objects (typed tensor spaces). Composition should preserve declared properties. Consider JAX’s vmap — it should satisfy:

vmap(f ∘ g) = vmap(f) ∘ vmap(g).

This is a functorial law. JAX doesn’t verify it ! edge cases can silently violate it. What the kernel provides: Declared invariants (Lipschitz bounds, equivariance properties, monotonicity) that the kernel verifies are preserved under composition. Type-level guarantees that model surgery maintains specified behaviors.

4. Topological Security The problem: In current architectures, any input can potentially affect any output. There’s no mathematical isolation between user contexts. This is why adversarial examples work: small perturbations in input space cause large changes in output because there’s no structural barrier to perturbation propagation.

The speculative idea: In physics, topologically protected states (quantum Hall effect, topological insulators) derive robustness from global invariants — properties that can’t be changed by local perturbations. Could neural architectures with toroidal or other non-trivial topology provide analogous protection? Could winding numbers create mathematically guaranteed separation between processing channels?

Honest caveat: This is the most speculative component. Quantum topological protection relies on properties that don’t directly transfer to classical computation. But the principle: robustness from global structure rather than local tuning, deserves exploration. At minimum, explicit topological structure could provide better isolation than the current “everything is connected to everything” default.

5. APEL: The Agent Process Execution Language* The four components above address the mathematical foundations — how individual models compute, compose, and protect their internal representations. But there’s a fifth crisis that’s entirely orthogonal: what happens when you deploy multiple AI systems that interact with each other?

This is where the AI Kernel needs something beyond mathematics. It needs process orchestration: the same layer that operating systems provide for managing concurrent programs, but designed for the specific chaos of AI agents. This is the killer feature. If you read nothing else, read this section.

The Agentic AI Catastrophe Unfolding Right Now Single AI models are already chaos. They hallucinate. They forget context. They confidently produce wrong answers. We’ve learned to cope — human review, guardrails, retrieval augmentation.

Now we’re putting these chaotic systems in charge of orchestrating other chaotic systems.

Agent A calls Agent B
Agent B spawns Agents C and D
Those call tools
Tools trigger Agent E

Each agent “hallucinates” independently. Errors compound. Context fragments across the swarm. The result? Production systems are experiencing:

Infinite delegation loops. Agent A asks Agent B who asks Agent A who asks Agent B…
Orphaned processes. Spawned agents that never terminate, burning tokens forever
Context amnesia. Agent chains where step 5 has no idea what step 1 was trying to accomplish
Cascade failures. One agent’s hallucination becomes another agent’s trusted input
Impossible debugging. When something breaks in a 12-agent chain, good luck finding where
No rollback capability. Agent 7 fails, but agents 1–6 already sent emails, modified databases, called APIs

This is not hypothetical. This is happening in production right now. Companies are deploying multi-agent systems with LangChain, AutoGPT, CrewAI — and watching them spiral into expensive, unpredictable chaos.

We Solved This Problem in 2003. Then Forgot.

In the early 2000s, enterprises faced a similar crisis. Service-Oriented Architecture (SOA) promised that you could compose web services into complex business processes. The reality?Services calling services calling services, with:

no coordination,
no error handling,
no way to manage long-running transactions.

The solution was BPEL — Business Process Execution Language (IBM/OASIS, 2003). BPEL provided: • Declarative workflow definitions. Specify the process structure, not just the code • Compensation handlers. If step 5 fails, here’s how to undo steps 1–4 • Correlation sets. When async response arrives, route it to the right process instance • Partner links with typed contracts. Services declare capabilities via WSDL • Long-running transaction management. Processes that span hours or days with proper state handling • Fault handlers. Structured error recovery, not just try/catch

BPEL was the process kernel for web services. It sat above SOAP/WSDL and provided the orchestration layer that made complex service compositions manageable. Then microservices happened, REST won, and everyone forgot that orchestration was a solved problem. Now we’re reinventing it badly for AI agents.

APEL: Agent Process Execution Language We need BPEL for agents — but modernized for the realities of 2025. We are calling this APEL: Agent Process Execution Language.

Here’s the mapping from SOA to Agentic AI:

Figure 3. BPEL → APEL: A visual mapping Everything on the left has a modern equivalent on the right — evolved for LLMs, tool calling, and MCP. But notice the pink badges: Agent Rollback Chains, Context Threading, Hallucination Recovery. These are primitives BPEL never needed. Web services were deterministic; your GPT-4 agent swarm is not. That’s the gap current frameworks aren’t filling.ng. Figure created by the author using Manim

The Seven Pillars of APEL “All men dream: but not equally. Those who dream by night in the dusty recesses of their minds wake in the day to find that it was vanity: but the dreamers of the day are dangerous men, for they may act their dreams with open eyes, to make it possible.” — T.E. Lawrence, Seven Pillars of Wisdom (1926)

Pillar 1: Declarative Workflow Definition Current agent frameworks are imperative Python spaghetti. You write code that calls agents that call code that calls agents. The workflow is implicit, buried in control flow.

APEL provides declarative workflow definitions. The workflow is visible. You can reason about it. You can visualize it. You can verify properties before deployment. If you've ever stared at a LangChain graph trying to figure out why Agent 3 is talking to Agent 7, this one's for you:

workflow CustomerSupportEscalation {

  agents {
    triage: TriageAgent with capabilities [classify, route]
    specialist: TechSupportAgent with capabilities [diagnose, solve]
    human: HumanEscalation with capabilities [review, override]
  }

  constraints {
    max_delegation_depth: 5
    max_total_agent_calls: 20
    cycle_detection: enabled
  }

  flow {
    start -> triage.classify

    triage.classify -> switch {
      case technical: specialist.diagnose
      case billing: billing_agent.process_inquiry
      case complex: parallel { specialist.diagnose, human.review }
    }

    specialist.diagnose -> specialist.solve
    specialist.solve -> end
  }

  on_error {
    log_error(context: full_execution_trace)
    escalate_to_human(reason: workflow_failure)
  }
}

See that on_error block? Remember it. By Pillar 2, you'll understand why it's the difference between "recoverable incident" and "CEO gets a call."

Pillar 2: Compensation Semantics This is the most critical missing piece. When agent 5 fails in a 7-agent chain, what happens to the work agents 1–4 already did?

Currently: nothing. The emails are sent. The database is modified. The API calls are made. You’re left with a partially-executed mess.

The enterprise folks learned this lesson in 2003. The rest of us are learning it now, with APEL providing compensation handlers:

scope OrderProcessing {

  execute {
    inventory_agent.reserve_stock()
    payment_agent.charge_card()
    shipping_agent.schedule_delivery()
    notification_agent.send_confirmation()
  }

  compensate {
    // Runs in reverse order automatically
    notification_agent.send_cancellation()
    shipping_agent.cancel_delivery()
    payment_agent.refund_card()
    inventory_agent.release_stock()
  }

  on_failure {
    execute_compensation()
    escalate_to_human(
      context: full_execution_trace,
      failed_at: current_agent,
      partial_state: executed_steps
    )
  }
}

If shipping_agent fails, the kernel automatically executes payment_agent.refund_card() and inventory_agent.release_stock(). The compensation runs in reverse order. The system returns to a consistent state. Without compensation, you’re left in an inconsistent state — and a” prayer” that nobody audits the logs.

Pillar 3: Correlation and Context Threading In async agent systems, responses arrive out of order. Agent B might respond before Agent A. External events trigger at unpredictable times. How do you route responses to the right process instance?

BPEL solved this with correlation sets. APEL extends it. Remember the “context amnesia” problem from earlier? Here’s the fix. Agent 5 will finally know what Agent 1 was trying to accomplish:

correlation CustomerContext {
  keys: [customer_id, session_id, conversation_thread]
  propagate: all_agents      // Every agent in chain sees full context
  persistence: durable       // Survives agent restarts
  compression: adaptive      // Summarize if context exceeds token limits
}

receive async_response {
  correlate_on: [customer_id]
  timeout: 30m

  on_timeout {
    retry_count: 2
    backoff: exponential
    finally: escalate_to_human(reason: response_timeout)
  }
}

That compression: adaptive line? That's the AI-specific addition BPEL never needed. Your web services didn't have 128k context windows to manage. Pillar 4: Typed Agent Contracts BPEL had WSDL: services declared their operations, inputs, outputs, and fault types. Current AI agents have… nothing. An agent’s capabilities are described in natural language prompts, if at all. System prompts are just suggestions. APEL provides guaranteed contracts. Here’s the difference:

agent TechSupportAgent {

  capabilities {
    diagnose: (issue: TechnicalIssue) -> Diagnosis
    solve: (diagnosis: Diagnosis) -> Solution | Escalation
  }

  constraints {
    max_tokens_per_call: 4000
    max_retries: 3
    confidence_threshold: 0.70
    allowed_tools: [search_kb, run_diagnostic, create_ticket]
    forbidden_actions: [delete_data, send_external_email, execute_code]
  }

  fault_types {
    HallucinationDetected,
    ConfidenceTooLow,
    ToolCallFailed,
    TokenBudgetExceeded,
    ContractViolation
  }

  on_fault(HallucinationDetected) {
    retry_with_lower_temperature()
    if still_hallucinating: escalate_to_human()
  }

  on_fault(ConfidenceTooLow) {
    request_human_verification()
  }
}

The kernel enforces these contracts. If TechSupportAgent tries to call delete_data, the kernel blocks it. If it returns something that doesn't match the Diagnosis type, the kernel raises a type error. No more hoping agents behave correctly. Notice forbidden_actions? That's the guardrail your prompt engineering was trying to be.

Pillar 5: Deadlock and Loop Detection Infinite delegation loops are the most common production failure in multi-agent systems. Agent A calls Agent B for help, Agent B calls Agent A for clarification, and they loop forever burning tokens.

This pillar exists because I once watched several thousand dollars evaporate in a few minutes. Agent A asked Agent B. Agent B asked Agent A. Rinse, repeat, goodbye rent money. APEL provides structural prevention:

workflow_constraints {
  // Can be set globally or per-workflow
  max_delegation_depth: 5
  max_total_agent_calls: 20
  max_tokens_budget: 100000
  cycle_detection: enabled

  on_cycle_detected {
    break_cycle()
    log_cycle_trace(
      agents_involved: cycle_path,
      total_iterations: count,
      tokens_burned: sum
    )
    escalate_with_context(severity: high)
  }

  on_budget_exceeded {
    graceful_shutdown()
    checkpoint_current_state()
    notify_admin(reason: token_budget_exceeded)
  }
}

The kernel tracks the call graph in real-time. When it detects A→B→A, it breaks the cycle immediately rather than letting it run until you hit API rate limits. That tokens_burned field in the log is for the post-mortem… and for the therapy.

Pillar 6: Checkpoint and Resume Long-running agent workflows, spanning hours or days, need to survive restarts, handle human-in-the-loop delays, and resume from interruptions. Your legal review workflow shouldn’t restart from scratch because someone rebooted a server. This is NOT 1995:

workflow LegalContractReview {

  checkpointing: after_each_stage
  persistence: durable
  ttl: 30d                    // Workflow expires after 30 days

  flow {
    extraction_agent.extract_clauses()       // Checkpoint 1
    risk_agent.analyze_risks()               // Checkpoint 2
    await human.legal_review(timeout: 48h)   // Checkpoint 3 - wait for human
    revision_agent.incorporate_feedback()    // Checkpoint 4
    finalization_agent.prepare_final()       // Checkpoint 5
  }

  resume_policy {
    on_crash: from_last_checkpoint
    on_timeout: notify_then_pause
    on_human_cancel: execute_compensation
  }

  on_checkpoint {
    persist(
      state: current_execution_state,
      context: accumulated_context,
      artifacts: generated_documents
    )
  }
}

If the system crashes after risk_agent completes, it resumes from Checkpoint 2—not from the beginning. The human can take several days to review; the workflow persists and wakes up when they respond.

See on_human_cancel: execute_compensation? That's Pillar 2 paying off. The pillars aren't isolated—they're load-bearing. Pillar 7: Observability and Debugging When a 12-agent workflow produces wrong output, how do you debug it? Currently: you can’t. You get the final output and have to guess where things went wrong. Every pillar before this was about preventing disasters. This one’s about understanding them when they happen anyway — because they will,

APEL provides first-class observability:

trace CustomerSupport-12345 {

  metadata {
    workflow: CustomerSupportEscalation
    started: 2025-01-15T14:32:00Z
    customer_id: cust_8x7k2m
    correlation_id: corr_9f8e7d
  }

  events {
    [00:00.000] START workflow=CustomerSupportEscalation
    [00:00.100] INVOKE triage.classify
                  input={ticket: "printer not working", priority: medium}
    [00:02.340] RETURN triage.classify
                  output={category: "technical", confidence: 0.92}
    [00:02.341] BRANCH -> specialist.diagnose (reason: category=technical)
    [00:02.450] INVOKE specialist.diagnose
                  input={issue: "printer not working", context: [...]}
    [00:05.120] TOOL_CALL search_kb
                  query="printer offline troubleshooting"
    [00:05.890] TOOL_RETURN search_kb
                  results=[{id: kb_123, relevance: 0.87}, ...]
                  tokens_used: 1,247
    [00:08.200] CONFIDENCE_CHECK specialist
                  score=0.43, threshold=0.70, result=BELOW_THRESHOLD
    [00:08.201] ESCALATE -> human.review
                  reason=low_confidence
                  context_snapshot={...}
                  tokens_total: 3,892
  }

  summary {
    total_duration: 8.201s
    agents_invoked: 2
    tool_calls: 1
    total_tokens: 3,892
    outcome: escalated_to_human
    cost_estimate: $0.078
  }
}

Full execution traces. Every agent call, every tool invocation, every decision point. When something goes wrong, you can replay the exact sequence and see where the failure occurred.

That cost_estimate: $0.078 at the bottom? Multiply by a thousand daily workflows. Now you understand why Pillar 5's budget controls exist.

Note: The APEL syntax shown throughout this article is illustrative pseudocode locally run and designed to convey semantic intent. A production implementation would require formal grammar specification (likely as a DSL transpiling to a workflow engine like Temporal, or as a YAML/JSON schema with runtime validation). The design prioritizes readability and conceptual clarity over parsing concerns.

Read the full article here: https://pub.towardsai.net/ai-urgently-needs-an-operating-system-no-one-is-building-a7a836e17674