Climbing the Ladder Too Far
Agentic AI has acquired the same hype trajectory that earlier AI categories experienced. Vendor slides show elaborate multi-agent architectures executing complex business processes autonomously. Real production workloads, two years after the agentic narrative began, are mostly simpler than the slides suggest. The gap between marketing and reality is not random. It is the predictable result of teams climbing an agentic ladder past the step that fit their actual need.
Anthropic's late-2024 research on agent design described the principle succinctly: the simplest architecture that solves the problem is the right architecture (Anthropic, "Building effective agents," 2024). Most teams over-engineer rather than under-engineer agentic systems. The cost of climbing too far on the ladder is paid in operational complexity, latency, and debugging difficulty rather than in obvious failure.
Knowing the ladder, and where each rung fits, prevents the over-engineering.
The Five Rungs
Agentic systems sit on a five-rung ladder. Each rung adds capability and operational cost. The right rung for your workload is usually the lowest one that solves the problem.
Rung one is a single-call LLM. Input goes in, output comes out, no tools, no iteration, no state. The simplest possible architecture. Often the correct one. A surprising number of "agent" projects could be done at rung one without the agent framing.
Rung two is a single LLM with structured outputs and validation. The LLM produces output in a defined schema. Validation runs after generation. Failed validation triggers a retry with feedback. Still a single agent, still no tools, but with reliability that rung one cannot provide for outputs that downstream systems consume.
Rung three is a single agent with tools. The LLM can call functions, APIs, retrieval systems. It iterates through a tool-use loop until the task completes. This is the rung that most production "agents" actually run on. Customer support agents, document analysis agents, code generation agents in their non-pathological forms. The agent has agency in the loose sense of being able to take actions, but coordination complexity is bounded because there is only one agent.
Rung four is a single agent with planning. The agent constructs an explicit plan, executes steps, and revises the plan based on results. The plan is a first-class artifact rather than emergent behavior. The complexity step up from rung three is significant because plan correctness and recovery from plan failure become new problems. Justified for workloads where the task decomposition is complex enough that emergent planning produces poor outcomes.
Rung five is multi-agent. Multiple agents coordinate to complete tasks. Each agent has specialized capability or context. The coordination is explicit (agents communicate, hand off, escalate) or implicit (agents work in parallel and a higher-level coordinator integrates results). The complexity step up from rung four is large. Operational overhead, latency variance, and debugging difficulty all scale faster than the capability added.
The progression matters because each rung pays a real cost. Skipping rungs ahead of need produces production systems that are harder to operate than they should be for the value they deliver.
When Each Rung Genuinely Fits
Rung one fits well-bounded, single-output tasks. Document classification, sentiment analysis, simple question answering on bounded knowledge, transformation tasks. The category is larger than it seems and a meaningful fraction of "agent" demand actually belongs here.
Rung two fits the same category as rung one plus workloads where output reliability matters. Anything feeding into a downstream system. Anything where the consumer needs to parse the LLM's response programmatically. Schema-validated generation is the production-grade default for outputs that are not direct human consumption.
Rung three fits tasks that require fetching information, calling external systems, or iterating with feedback. Customer support that needs to look up account details. Coding that requires reading files and executing commands. Research that requires multiple searches. The tool-use loop covers most real production agent workloads.
Rung four fits multi-step workflows where the steps are interdependent and the order matters. Complex document workflows, multi-stage analysis tasks, project planning. The plan-execute-revise pattern handles these meaningfully better than emergent tool use.
Rung five fits workloads where genuine specialization across agents produces value greater than the coordination cost. Research workflows with distinct gather-analyze-write phases that benefit from separate prompts. Workflows with parallel subtasks that benefit from concurrent execution. The number of such workloads is smaller than the vendor marketing suggests.
The Coordination Cost Multi-Agent Pays
Moving from rung four to rung five introduces three costs that single-agent architectures do not pay.
The first is communication overhead. Agents have to communicate their intermediate state, results, and decisions. The token economics of multi-agent often dominate. Two agents that each emit verbose reasoning to coordinate can cost 5-10x what a single agent solving the same task would cost.
The second is debugging surface area. When a multi-agent system produces a wrong outcome, the forensics work involves reconstructing the conversation between agents, identifying which agent introduced the wrong state, and figuring out what context that agent was operating with. The forensics scale worse than the agent count.
The third is latency variance. Multi-agent systems have long-tail latency because the slowest agent in any given run determines the total. User experience suffers in proportion to variance, not to mean.
These costs are paid in exchange for benefits that have to be real to justify them. Workloads where the benefits are speculative or marginal usually do not survive the operational cost at scale.
The Frameworks That Have Settled
Three framework families have settled into the production agentic stack in 2024-2025.
Tool-use frameworks built around the Anthropic and OpenAI APIs provide rung three and four capability with minimal framework overhead. Most production single-agent systems use these directly with thin internal abstractions.
Graph-based frameworks like LangGraph provide rung four and five capability with explicit state and step representation. The frameworks add operational complexity in exchange for observability and control over multi-step or multi-agent flows.
Specialized vertical frameworks (autogen, crew, swarm patterns) provide opinions about multi-agent coordination. These have niches but have not become broadly adopted because the opinions often do not match real production needs.
The framework choice matters less than the rung choice. A well-designed system on the right rung uses any framework adequately. A wrong-rung system has problems no framework fixes.
What Logiciel Does Here
Logiciel works with engineering teams designing or rescuing agentic systems. The most common engagement is rescue: a team built a multi-agent system that is operationally fragile and needs an honest assessment of whether to simplify, restructure, or rebuild against a lower-rung architecture.
The Multi-Agent Systems Architecture framework covers the orchestration tax in more depth. The Agentic Enterprise Workflows framework covers the four-axis assessment for whether multi-agent is justified.
A 30-minute working session is enough to assess your current or proposed system against the five-rung ladder.
Frequently Asked Questions
How do I know which rung my workload needs?
Start at the lowest rung that could plausibly solve the problem. Try it. If it works, you have your answer. The cost of trying rung one is hours. The cost of trying rung five and discovering rung three would have sufficed is months.
Can I move down the ladder after deploying higher?
Yes, and the move usually improves operational characteristics. Production systems that started at rung five sometimes simplify to rung four or three after operating for a few quarters and learning what coordination was actually necessary.
What is the maximum useful agent count?
For most production workloads, three or four. Above five, the coordination cost dominates and benefits rarely scale with count. The exceptions are very specific workflows with genuine parallelism that benefits from many concurrent agents.
How does cost scale across the ladder?
Inference cost typically scales roughly 2-3x per rung as you climb, holding workload constant. The cost increase is from additional context (each agent has its own context), additional tokens (agents communicate verbosely), and additional iterations (multi-agent systems often run more loops).
When is the multi-agent pattern actually worth the cost?
When genuinely specialized agents produce results that a single agent with the same context could not produce. This is rarer than the marketing suggests. Workloads where one agent with sufficient context performs comparably to a multi-agent decomposition are common and should usually stay single-agent. Sources: - Anthropic, "Building effective agents," 2024 - OpenAI, "GPT-4 with Function Calling," 2024