Why one agent is not enough A single AI agent with tool access can handle straightforward multi-step tasks: research a topic, query a database, generate a report. Complex tasks are different. When a workflow requires different types of expertise, spans multiple systems, or genuinely benefits from a second perspective doing quality control, a single agent runs into hard limits — context window, reasoning depth, and the inability to specialise its prompt for conflicting roles at once. Multi-agent systems address this by decomposing the task across specialised agents: a planner that breaks the work into subtasks, specialists that execute each one, a reviewer that evaluates intermediate outputs, and an orchestrator that coordinates the workflow. Each agent has a defined role, a focused set of tools, and a prompt tuned to its area of responsibility. The appeal is compelling — instead of one model trying to be good at everything, each model focuses on what it does best. The reality is more complex. Coordination between agents introduces failure modes that single-agent systems do not have, and these failure modes are the primary risk in production deployments. Across our agentic-AI engagements, we have seen the same pattern repeatedly: the demo works, the small-scale evaluation works, and then production exposes the coordination layer as the weakest link. Two numeric framings are worth keeping in mind before going further. Multi-agent systems consume substantially more tokens than single-agent approaches — often several multiples more (a directional industry-scale figure from published evaluations, not a benchmarked rate for any specific workload) because every inter-agent message is itself a model invocation. And while early benchmarks on complex software-engineering tasks suggest multi-agent setups can achieve meaningfully higher task-completion rates than single-agent ones (an observed pattern across published evaluations, with magnitude varying substantially by task type and orchestration design), the gain is conditional on the coordination layer being engineered properly. When does a problem genuinely require multi-agent architecture? This is the question we ask before any orchestration design work begins. The naive approach is to chain multiple agents together because “agents are powerful.” The expert approach is to verify, before adding agents, that a single agent or plain automation cannot solve the problem within acceptable cost and quality bounds. A quick decision rubric we use: Signal Single agent likely sufficient Multi-agent likely justified Task decomposition Steps share context; output of step N feeds step N+1 within one reasoning thread Steps require genuinely different expertise, tools, or prompt framings Quality control Self-checking via tools (tests, validators) is sufficient A second perspective produces measurably better outputs (e.g. coder-reviewer pairs) Token economics Task completes within a single context window Task naturally exceeds context, or specialisation reduces per-agent prompt size Failure cost Errors are recoverable cheaply Errors are expensive enough to justify a reviewer agent’s overhead Concurrency Sequential reasoning is fine Subtasks are genuinely parallelisable If the right column is not the honest answer to most rows, the system does not need multi-agent architecture. Adding agents for the sake of architectural sophistication is the over-engineering failure mode this methodology exists to prevent. Coordination patterns When multi-agent is justified, the next decision is the coordination pattern. Four patterns dominate in practice, each with different reliability and flexibility characteristics. Sequential pipeline. Agent A completes its task and passes output to Agent B, which passes to Agent C. The pipeline is simple, predictable, and easy to debug — each agent’s input and output are visible, and failures localise to the agent that produced the bad output. The limitation: sequential processing cannot handle tasks that require iteration or feedback between agents. Hierarchical delegation. A manager agent receives the task, decomposes it into subtasks, delegates each subtask to a specialist, collects the results, and assembles the final output. The manager handles planning and quality assessment; the specialists handle execution. This pattern mirrors human project management and works well for tasks with clear decomposition — but the manager’s planning capability becomes the ceiling for the whole system’s performance. Collaborative discussion. Multiple agents converse in a shared transcript, building on each other’s contributions. A coder writes code, a reviewer critiques it, the coder revises, and the process iterates until the reviewer approves. Frameworks like AutoGen and CrewAI implement variants of this pattern. It is flexible and produces high-quality output through iteration — and it is also the hardest to control, because the agents can enter unproductive loops, disagree without resolution, or generate excessive conversation that consumes context without advancing the task. Event-driven orchestration. A workflow engine dispatches tasks to agents based on events and conditions, with no single manager agent. Each agent registers capabilities and responds to task requests that match its specialisation. This pattern scales well and decouples agents from each other — but requires a robust orchestration layer handling task routing, failure recovery, and resource management. The choice between patterns is rarely about elegance. It is about which failure modes you can afford and which you can engineer around. Sequential pipelines fail visibly; collaborative discussions fail expensively; event-driven systems fail in ways that require strong observability to even detect. How do multi-agent systems break in production? The coordination patterns work in demos and controlled experiments. In production, they break in specific, predictable ways. Naming the failure classes is the first step in designing against them. Do agents lose context between handoffs? Yes — and this is the single largest determinant of multi-agent system reliability in our experience. When Agent A passes output to Agent B, the information about why Agent A made its decisions — the reasoning, the alternatives considered, the confidence level — is typically lost. Agent B receives the output but not the context. If Agent B needs to make a judgment call about Agent A’s output (should it trust this data? verify it? request clarification?), it lacks the information to make that judgment well. The fix is structured handoff protocols that carry not just the output but the reasoning, an explicit confidence assessment, and flags for cases where the upstream agent was uncertain. JSON-shaped handoffs with declared schemas, validated at the boundary, work better than free-form text. The overhead is real, but it is the difference between a system that degrades gracefully and one that hides its uncertainty until the final output is wrong. Do agents hallucinate coordination? Yes. An agent asked to “verify the output of the previous step” may generate a verification response that looks plausible but does not actually check anything — it hallucinates the verification. An agent asked to “delegate this subtask to the database specialist” may produce a paragraph describing what the database specialist would do, rather than actually invoking the specialist. These hallucinated coordination actions are dangerous because they appear correct in the conversation transcript while producing no real result. The fix is tool-enforced coordination rather than prompt-based coordination. Delegation should trigger an actual agent invocation through a function call, not a text description of delegation. Verification should check actual outputs — compare against ground truth, run automated tests, query a database — not generate a narrative about verification. The orchestration framework, whether LangGraph, AutoGen, or a custom runtime, has to make hallucinated coordination structurally impossible, not merely discouraged in the prompt. Do agents enter unbounded loops? Yes. A coder-reviewer loop can iterate indefinitely: the coder makes a change, the reviewer finds a different issue, the coder addresses it, the reviewer finds another issue. Without explicit termination conditions, the loop consumes tokens and compute without converging. We have observed multi-agent systems consume hundreds of thousands of tokens on a single task without producing a final output, because the agents were trapped in a refinement loop with no convergence criterion. The fix is explicit loop bounds (maximum iterations), convergence detection (terminate when the diff between iterations falls below a threshold), and escalation protocols (after N iterations, escalate to a human rather than continuing indefinitely). Loops without budgets are the agentic equivalent of an infinite recursion — and they cost real money. Do agents conflict on shared state? Yes. When multiple agents can modify shared resources — a document, a database, a code file — concurrent modifications produce conflicts. Agent A modifies section 3 of a document while Agent B modifies section 5 based on a different version, and the final artefact contains inconsistencies that neither agent would have produced alone. The fix follows standard concurrency engineering: serialised access where one agent at a time modifies a resource, versioned state where each modification is applied to a specific version and conflicts are detected, or resource partitioning where each agent owns specific resources and no other agent touches them. The point is that “agents are AI” does not exempt them from the same race conditions that any concurrent system faces. Production multi-agent architecture Deploying a multi-agent system in production requires engineering well beyond the agent logic itself. Observability has to be planned from day one. Every agent action, tool invocation, and inter-agent communication needs to be logged with enough detail to reconstruct the complete execution trace. When the system produces an incorrect output, the trace reveals which agent produced the error, what input it received, and what reasoning it followed. Without that, debugging a multi-agent failure is significantly harder than debugging a single-agent one — and “significantly harder” tends to mean engineers staring at a transcript trying to guess which step went wrong. Cost management is the second pillar. Multi-agent systems consume tokens multiplicatively: each agent processes its own context, and every inter-agent message adds to the total. As an illustrative figure from our agentic-AI engagements (an observed pattern across our engagements, not a benchmarked industry rate): a five-agent system processing an average task across ten rounds of communication can consume on the order of 50–100× the tokens of a single-agent approach. The cost must be managed through efficient prompt design, context-window discipline, and explicit bounds on communication rounds. Graceful degradation is the third. When one agent fails — produces an error, times out, or returns low-quality output — the system must handle the failure without cascading. In our observation, multi-agent failures cascade faster than single-agent failures when coordination protocols do not include explicit failure handling. The agentic AI system design principles for failure handling are not optional for multi-agent — they are the price of admission. A note on the underlying model layer. Most production multi-agent systems we have worked with run on LLM-based orchestration — agents are LLMs with prompts, tools, and a coordination framework around them. This is structurally different from multi-agent reinforcement learning, where agents learn coordination policies through interaction with an environment. MARL is the right frame for problems like robotic swarms, traffic optimisation, or game-playing agents, where the coordination policy itself is the thing being learned. LLM-based multi-agent is the right frame for problems where the agents are wrapping general-purpose reasoning and tool use. The methodology in this article addresses the latter; the failure modes of the former — non-stationarity, credit assignment, reward shaping — are a different discipline. Multi-agent control-policy template The failure modes above are preventable when each agent operates under explicit control policies. The template below defines the parameters we configure for every production multi-agent deployment. Values are defaults; adjust per workload. Policy category Parameter Default Notes Retry policy max_retries_per_agent 2 Retries on transient errors (timeouts, rate limits). Not on logic failures. retry_backoff Exponential, base 2 s Prevents thundering-herd on shared resources. retry_scope Per tool call Retry the failed tool invocation, not the entire agent turn. Escalation policy escalate_after_retries true After max_retries_per_agent exhausted, escalate rather than fail silently. escalation_target Human-in-the-loop Options: parent agent, fallback agent, human queue. escalation_context Full trace Include agent reasoning, inputs received, and failure details in escalation payload. Loop bounds max_iterations 5 Hard cap on coder-reviewer or refinement loops. convergence_threshold Δ < 5 % between iterations Terminate early when changes between iterations fall below threshold. loop_cooldown 0 s Optional delay between iterations to allow state propagation. Timeout policy agent_turn_timeout 120 s Maximum wall-clock time for a single agent turn including tool calls. pipeline_timeout 600 s Maximum wall-clock time for the full multi-agent pipeline. idle_timeout 30 s Kill agent if no progress (no tool call, no output token) within window. Cost circuit-breaker max_tokens_per_task 100 000 Hard token budget for the entire task across all agents. max_cost_per_task Configurable per tier Dollar-denominated cap; prevents runaway spend on refinement loops. alert_threshold 70 % of budget Emit warning when token or cost consumption crosses threshold. These defaults are observed-pattern defaults from our engagements — not benchmarked thresholds. They prevent the most common production failures: unbounded refinement loops that consume hundreds of thousands of tokens, silent failures that cascade downstream, and hallucinated coordination that bypasses actual tool invocations. Every parameter should be logged to the observability layer so that post-incident analysis can trace which bound was hit and why. FAQ What is a multi-agent system, and how do its agents coordinate? A multi-agent system decomposes a task across multiple specialised agents — typically a planner, one or more specialists, sometimes a reviewer, and an orchestrator. Coordination happens through one of four patterns: sequential pipeline (linear handoff), hierarchical delegation (manager and specialists), collaborative discussion (shared transcript with iteration), or event-driven orchestration (workflow engine dispatching to capability-registered agents). The choice depends on which failure modes are tolerable for the workload. When does a problem genuinely require multi-agent architecture versus single-agent or plain automation? When the work requires genuinely different expertise or prompt framings, when a second perspective measurably improves quality, when the task naturally exceeds a single context window, or when subtasks are genuinely parallelisable. If a single agent with tools can do the work within acceptable cost and quality, multi-agent is over-engineering. Demonstrate that the simpler approach is insufficient before adding agents. How do multi-agent systems break in production — failure cascades, deadlocks, behavioural drift? The four dominant failure classes are: context loss across handoffs (downstream agents make decisions without upstream reasoning), hallucinated coordination (agents describe delegation or verification rather than performing it), unbounded loops (refinement cycles with no convergence criterion), and shared-state conflicts (concurrent modifications producing inconsistencies). Each has a structural fix — schematised handoffs, tool-enforced coordination, explicit loop bounds, and concurrency-control primitives — and none of them are optional. What design patterns govern inter-agent communication and responsibility decomposition? Structured handoff protocols carrying reasoning and confidence alongside output, tool-enforced delegation that triggers actual agent invocations rather than narrative descriptions, explicit role boundaries enforced through focused prompts and tool access, and shared-state discipline through serialisation, versioning, or partitioning. The patterns matter less than the principle: coordination must be a first-class engineered surface, not an emergent property of prompts. How do I monitor a multi-agent system in production and detect coordination failure early? Log every agent action, tool invocation, and inter-agent message with enough detail to reconstruct the trace. Track token and cost consumption per task against explicit budgets, with alerts at 70 % of budget. Monitor loop iteration counts, convergence deltas, and agent turn durations. Detect coordination failure through silent-failure signals — agents producing plausible text without corresponding tool calls — and through escalation-rate trends. Without this telemetry, multi-agent debugging is guesswork. How does multi-agent reinforcement learning differ from LLM-based multi-agent orchestration? Multi-agent reinforcement learning (MARL) learns coordination policies through environment interaction — the policy itself is the learned artefact, and the discipline centres on non-stationarity, credit assignment, and reward shaping. LLM-based multi-agent orchestration wraps general-purpose reasoning models with prompts, tools, and a coordination framework — the coordination logic is engineered, not learned. MARL is the right frame for robotic swarms, traffic optimisation, and game agents; LLM-based multi-agent is the right frame for knowledge-work automation built on general-purpose models. Multi-agent coordination failures are expensive to debug in production and straightforward to prevent in design — a GenAI Feasibility Assessment includes multi-agent system design and failure-mode analysis as part of the scoping work.