The pattern determines the failure mode When an LLM-based agent behaves unreliably, the problem is often not the underlying model — it is the agent architecture pattern. The same model can behave very differently when given a ReAct scaffolding versus a Plan-and-Execute scaffolding. Understanding which patterns exist and what failure modes each introduces is more useful than model selection for improving agent reliability. ReAct (Reasoning + Acting) ReAct interleaves reasoning steps and tool calls in a single pass. The model generates a “thought,” takes an “action” (tool call), observes the result, and continues. This is the simplest widely-used agent pattern. Strengths: Low overhead, fast iteration, easy to implement. Failure modes: Gets stuck in loops when early actions produce ambiguous results; reasoning and action get entangled (model reasons about what to do but then calls the wrong tool); context window fills up on long tasks. Best for: Short tasks with 2–5 steps, well-defined tool APIs, tasks where partial completion is acceptable. Plan-and-Execute A planner generates the full task plan first, then an executor runs each step. The planner and executor may be separate model calls or the same model in different roles. Strengths: More coherent long-horizon behavior; easier to validate the plan before execution; allows parallelizing independent steps. Failure modes: Plans degrade for tasks requiring information gathered mid-execution; planning LLM optimism produces unrealistic plans; replanning on failure is often not implemented and execution halts. Best for: Tasks where the full plan can be determined upfront, multi-step research or document processing pipelines. Reflection / Self-Critique After completing a task (or step), the agent evaluates its own output against criteria and iterates. May involve a separate “critic” model call. Strengths: Catches obvious errors before delivery; improves output quality for generation tasks. Failure modes: Self-critique loops that never converge; the model is often unable to identify the specific error type it just made; adds latency without always improving quality. Best for: Document generation, code with verifiable outputs, tasks with clear quality criteria. Pattern comparison Pattern Steps handled well Main failure mode Typical latency multiplier ReAct 2–5 steps Loops, context overflow 1× Plan-and-Execute 5–15 planned steps Plan-reality mismatch 1.5–2× Reflection Quality-sensitive tasks Non-convergent loops 1.3–1.8× Multi-agent delegation Parallel specialized tasks Coordination failures 2–4× What actually matters more than pattern selection In our experience, agent reliability depends more on: Tool API design — Poorly designed tools (ambiguous parameters, side effects, no error messages) cause more failures than architecture choices Context management — Trimming context intelligently vs allowing it to fill with tool outputs Failure handling — Explicit retry logic with backoff, not just hoping the model will recover Scope constraint — Narrowing the action space to only what’s needed for the task For multi-agent coordination specifically, how multi-agent systems coordinate and where they break covers the failure modes that emerge specifically from agent-to-agent communication. How should you choose a pattern? Start with the simplest pattern that satisfies the task requirements. ReAct for short tasks (3–8 steps) where each step depends on the previous result — search-and-answer, data lookup, simple API orchestration. Plan-and-Execute for complex tasks (8–20 steps) with known structure where you need to validate a plan before committing to actions — we use it for multi-step data processing workflows, report generation, and code generation tasks. Reflection only when output quality is the primary constraint and latency is not — the evaluation step typically adds 30–50% to per-request LLM cost. Our selection heuristic: start simple, escalate when the simpler pattern fails. ReAct fails when it makes locally optimal decisions without considering the full plan. Plan-and-Execute fails when the plan requires information gathered mid-execution. Reflection fails when the model cannot identify the specific error type it produced. Each failure mode signals escalation to the next pattern — or a different approach entirely. Add multi-agent patterns only when task parallelisation provides measurable value — the coordination overhead is real and typically 2–4× latency. What monitoring does a production agent need? Production agents need three monitoring layers: input monitoring (are requests within the expected distribution?), execution monitoring (is the agent completing tasks within expected step counts and latency?), and output monitoring (are results meeting quality thresholds?). Input monitoring catches distribution shifts — if production requests look different from the requests the agent was tested against, reliability predictions no longer hold. We flag requests that contain unfamiliar tool names, unusually long inputs, or domain terminology not present in the evaluation dataset. Execution monitoring tracks step count and token usage per request. A sudden increase in average step count indicates that the agent is encountering more difficult requests or that a tool has degraded (returning less useful results, forcing the agent to retry). We set alerts at 2× the baseline step count for investigation. Output monitoring evaluates a sample of agent outputs against quality criteria — either automatically (for structured outputs) or via human review (for free-text outputs). The review rate depends on the application’s risk profile: high-risk applications (financial advice, medical triage) require higher review rates than low-risk applications (content summarisation, data extraction).