Separating shipped from speculation in agentic AI Agentic AI has attracted an unusual volume of announcement-driven coverage that conflates early research, private betas, and production deployments. For teams evaluating whether and how to adopt agentic systems, the gap between “demonstrated in a controlled setting” and “running reliably in production” is the critical distinction. This article tracks what is actually deployed, what is in constrained pilots, and what remains primarily research as of mid-2026. What is shipping in production? Category Status Examples Caveat Code generation assistants Widely deployed GitHub Copilot, Cursor, Codeium Bounded scope: suggestions within editor context Customer service automation Deployed at scale Airline/telco tier-1 support Narrow domains, high human fallback rates RAG-based knowledge workers Deployed in enterprises Document Q&A, internal search Quality depends heavily on retrieval quality Code review and test generation Deployed in CI pipelines PR summarization, test scaffolding Reliability varies by language/framework Workflow automation with tool use Constrained deployment CRM data entry, scheduling Requires constrained action spaces The common thread in what is actually working: constrained scope, bounded action spaces, and reliable human fallback paths. Agents that can take any action in an open-ended environment are not in reliable production use at scale. What is in controlled pilots Several categories appear frequently in announcements but remain in constrained enterprise pilots: Multi-agent research pipelines — Systems where specialized agents conduct literature review, generate hypotheses, and draft sections. Working in some pharmaceutical and academic contexts with heavy human review. Not autonomous. Software development agents — Agents that can file issues, write code, submit PRs. Working in limited scope (bug fixes in well-tested codebases). Failure rate remains high for open-ended feature work. Autonomous browsing and data extraction — Agents that navigate web interfaces to collect data or complete forms. Technically feasible but brittle against UI changes. What remains primarily research Long-horizon planning with reliable goal decomposition (>5 steps reliably) Self-improving agents that modify their own reasoning processes Multi-agent systems coordinating effectively on complex open-ended tasks without human intervention Reliable tool-use chaining across diverse, untested APIs The difference between agentic AI and generative AI clarifies the architectural and capability distinctions that matter when evaluating specific products. How to evaluate agentic AI claims When evaluating an agentic AI product or announcement, apply these questions: What is the action space? Narrow (fill form, send email) vs open-ended (do anything on the web) determines reliability. What is the human fallback rate? Systems that advertise 95% automation often have 40% fallback in production conditions. What happens on failure? Does the agent halt and escalate, or does it take incorrect actions silently? What is the evaluation benchmark? Demos on cherry-picked tasks, internal benchmarks, and published academic benchmarks have very different reliability implications. What does production look like? A deployment at one enterprise with heavy configuration is not evidence of general deployability. The most reliable current deployments share a pattern: they work within a small, well-defined action space, have clear failure modes, and route to humans for anything outside that scope. What distinguishes production-ready agent systems from demos? The gap between an impressive agent demo and a production-ready agent system is primarily about failure handling, cost control, and evaluation methodology — not about the core AI capability. Production-ready agents need explicit failure modes. A demo can retry indefinitely or fail gracefully with an error message. A production agent handling customer requests must distinguish between retriable failures (API timeout, rate limit), non-retriable failures (impossible request, missing permissions), and partial successes (completed 3 of 5 requested actions). Each failure type requires a different response: retry with backoff, inform the user with an explanation, or report partial completion with options for the remaining actions. Cost control separates production from prototype. A demo agent can call an LLM API dozens of times to reason through a complex request. A production agent processing thousands of requests per day must bound its cost per request. We implement token budgets per request (maximum input + output tokens across all LLM calls), step budgets (maximum number of tool calls per request), and latency budgets (maximum wall-clock time before the system returns a response, even if incomplete). Evaluation methodology is the third gap. Demos are evaluated on cherry-picked examples. Production agents need systematic evaluation: accuracy on a representative test set, latency distribution across request types, cost per request by complexity tier, and failure rate by failure category. We build evaluation datasets of 200–500 requests covering the full range of expected use cases, and run the agent against this dataset after every code change. This catches regressions that human testing misses because the test dataset exercises edge cases that testers rarely think to try. The actual AI model capability is usually not the bottleneck. GPT-4-class models are capable enough for most agent tasks. The engineering around the model — tool integration, error handling, cost management, and systematic evaluation — determines whether the system is production-ready.