Start narrow, widen with observability The most common agent building mistake is starting with the full vision: multi-tool, multi-step, autonomous operation from day one. This produces agents that fail unpredictably on edge cases that only emerge at scale — and without the observability infrastructure to diagnose why. Production agent development follows a narrow-first pattern: start with a single tool, single goal, deterministic fallback — then widen incrementally with observability at each step. Each expansion adds one capability, measures its reliability in production, and only proceeds to the next capability once the current level achieves target reliability. The ReAct pattern: simplest viable agent loop The ReAct (Reason + Act) pattern provides the simplest viable agent loop — a cycle of: Reason — The LLM analyses the current state and decides what to do next Act — The agent calls a tool with specified parameters Observe — The tool returns a result that becomes part of the agent’s context Repeat — Until the goal is achieved or a stopping condition is met User goal: "Find the quarterly revenue for Company X" → Reason: "I need to search for Company X's financial reports" → Act: search_tool("Company X Q4 2025 revenue") → Observe: "Company X reported $4.2B revenue in Q4 2025 (source: earnings call)" → Reason: "I have the answer" → Return: "Company X's Q4 2025 revenue was $4.2B" The ReAct pattern requires explicit failure detection that most tutorials omit. What happens when the search returns no results? When the result is ambiguous? When the tool times out? Each failure path needs a defined response — retry with modified query, try alternative tool, or report inability to complete. The narrow-first build sequence Phase Capabilities Success criterion Common failure to watch for 1. Single tool, single goal One tool, deterministic task, hardcoded stopping >95% success rate on 100 representative inputs Tool output parsing failures 2. Single tool, flexible goal Same tool, variable user goals, LLM-driven planning >90% on diverse inputs Goal misinterpretation, unbounded execution 3. Multi-tool, single goal 2–3 tools, routing decision required >85% with correct tool selection Tool selection errors, context overflow 4. Multi-tool, multi-step Full agent loop with 3–10 step tasks >80% task completion Compounding errors across steps, cost explosion 5. Multi-agent Multiple specialised agents coordinating >75% on complex workflows Coordination failures, conflicting actions Each phase doubles the complexity. Skipping phases means encountering all failure modes simultaneously without the instrumentation to isolate which phase introduced each failure. What production agents require that tutorials skip Token budget management. Multi-step agents accumulate context across steps. Without explicit budget management, a 10-step task can consume 50,000+ tokens — 10× a single LLM call. Production agents need: context summarisation between steps, maximum step limits, and cost-per-task tracking. Idempotency and reversibility. Agents that take actions (write to databases, send emails, modify files) must handle the case where they are interrupted mid-execution and restarted. Are the actions idempotent? Can partially-completed tasks be safely retried? This is not a model problem — it is a systems engineering problem. Structured output validation. When an agent calls a tool, the tool expects specific input formats. The LLM generating those inputs will occasionally produce malformed output (wrong JSON structure, missing required fields, hallucinated parameter names). Structured output validation between the LLM and the tool call — rejecting and re-prompting on malformed output — is essential for reliability. Graceful degradation. When the agent cannot complete a task (tool unavailable, task too complex, ambiguous requirements), it must fail informatively rather than hallucinate completion. Explicit “I cannot complete this because…” responses are a sign of a well-engineered agent; silent failures or fabricated results are a sign of missing guardrails. Framework selection The framework landscape (LangChain, LangGraph, CrewAI, AutoGen, Semantic Kernel, custom) matters less than the engineering patterns applied within it. Understanding why generative AI projects fail before they launch — scope inflation, evaluation gaps, demo-to-production underestimation — applies directly to agent development. Teams that choose a framework based on feature count rather than debuggability and observability support encounter the same failure patterns. The framework should make it easy to: trace each step, replay failures, inject test inputs at any point in the loop, and measure reliability per-step rather than only end-to-end.