Editing AI Automation vs Ad-Hoc Tasks with LLM

When developing solutions based on agent frameworks, you face a dilemma: how to make the agent’s work reliable and predictable without going overboard with tool integrations (since those are what largely determine agent capabilities). Those of you choosing the easy path of ready-made agents extensible through MCP are martyrs. This kind of instant-fix mess falls apart quickly and causes major indigestion.

[[file:AI_Automation_vs_Ad-Hoc_Tasks_with_LLM.jpg|500px]]

Need repeatable results, stability, predictability, consistency, and easy debugging? Forget about it. An agent loop based on ReAct + MCP tools is pure alchemy. Only prayers and a lucky rabbit’s foot will help you here.
The main reason is that:

🔹 The model itself is non-deterministic, yet you’re asking it to make the same decision given identical input data. There will be no idempotency. It will periodically go down different paths, calling tools with wrong parameters or completely wrong tools (especially when there are many tools from the same domain).

🔹 You end up with way too many failure points. Each MCP tool individually is a wrapper on top of an API, and the ReAct agent involves LLM calls in a loop. You get at least these failure points: API internals, API↔MCP server integration, MCP server↔agent loop integration, failures within the agent loop itself (wrong tool call decisions, bad reflection, infinite loops, model API failures), plus actual code bugs. If you calculate the failure probability of each node separately, you get depressing numbers. Let’s say you’re a former pacemaker developer, and all the services you use were also built by people like you — meaning all nodes have 99.99% reliability. If you multiply all the reliabilities (and raise this to the power of the number of agent flow steps), you get the expected workflow reliability.

Roughly, for a 5-step system flow you get: 0.9999 (API) × 0.9999 (MCP) × 0.9999 (Agent) × 0.9999 (Framework) ^ 5 = 99.8%. Not even “three nines” anymore. But realistically, all these components will be lucky to have 95% reliability. Then the total reliability of such a flow comes out to around 35%. Meaning every third request will complete successfully, while two out of three will fail. This is roughly what we’re seeing in quick-fix agent solutions today.

The problem is that business doesn’t need 35% reliability. It needs at least 90%. So if you’re aiming for B2B success, you need to look straight at the complex and long-term approach. What does this mean?

✓ Forget ReAct and CodeAct — build your own step planner (e.g., as a DAG) and a runner to execute them.
✓ Forget MCP tools — build your own integrations with the APIs you need.
✓ Don’t let the model make decisions — do that in code. Use the model for three things it excels at: 1) extracting facts from unstructured data and returning them in structured form 2) transforming one data representation into another 3) generating content.

Overall, frame it as an engineering problem: “I need to keep everything under control,” making it more rule-based rather than alchemical, AI-based.

What are the downsides? Yes, you’ll have to buckle down, do some design work, and write code. An agent solution based on your own business tools is harder to scale compared to an agent based on MCP tools, of which there are millions now (though 90% of that is garbage).

But if clients come to you saying, “Why don’t you just build an agent based on Claude Desktop + MCP tools?”, you need to clarify first what scenarios they’re planning for and what reliability they expect. Because it’s quite possible they don’t need high accuracy and reliability, and the main use case for the agent is routine ad-hoc tasks. In that case, Claude Desktop + MCP or ReAct + MCP tools could very well be the solution.

Read the full article here: https://medium.com/@mne/ai-automation-vs-ad-hoc-tasks-with-llm-dfb6867ca64c