The failure rate is not surprising — the failure patterns are predictable
Generative AI projects fail at high rates. Industry estimates vary, but the consensus from consulting firms and technology analysts (Gartner, McKinsey, BCG) converges on a 60–80% failure rate for AI projects broadly — most enterprise AI projects fail for predictable structural reasons — with GenAI projects trending toward the higher end due to the novelty of the technology and the gap between demo capability and production reliability. The failure rate itself is not informative — it is a symptom. The useful question is: why do they fail, and can the failure be predicted before the investment is committed?
The answer to both questions is yes. GenAI project failures cluster around a small number of predictable patterns. Identifying these patterns before development begins — or during the first weeks of a project, before the investment accumulates — is the difference between a controlled decision to proceed or pivot, and an expensive discovery that the project was never going to work.
Why does the demo-to-production gap kill projects?
A GenAI demo is easy to build and impressive to present. A RAG chatbot powered by GPT-4, connected to a company knowledge base, running in a Jupyter notebook — this can be built in days and shown to stakeholders within a week. The demo answers questions. The stakeholders are impressed. The project gets funded.
The demo did not address: authentication (who is allowed to ask what?), hallucination management (what happens when the model generates a confident but incorrect answer?), latency requirements (the demo tolerated 5-second response times; production requires sub-1 second), cost at scale (the demo processed 50 queries; production will process 50,000 per day at £0.03 per query), integration with existing systems (the demo ran standalone; production must integrate with the CRM, the ticketing system, and the internal SSO), monitoring (how does the team know when the model is producing bad output?), and update management (the knowledge base changes daily; how does the RAG index stay current?).
Each of these is a solvable engineering problem. Collectively, they represent 80–90% of the project’s total effort and cost. The demo represents 10–20%. Projects that are funded based on demo capability, without scoping the production engineering, are systematically underestimated — and they fail when the budget allocated for the demo-equivalent effort runs out before the production engineering is complete.
Pattern 2: Evaluation without ground truth
A GenAI model generates text. Is the text good? For many GenAI use cases — creative writing, marketing copy, conversational responses — “good” is subjective. There is no ground truth to compare against, no objective metric that separates a good output from a bad one.
This creates an evaluation problem that cascades through the project lifecycle. Without objective evaluation metrics, the team cannot measure whether changes improve the system (did the new prompt template produce better responses?). Without measurable improvement, iteration is blind — each change might help, might hurt, or might be neutral, and the team cannot tell which. Without measurable progress, the project cannot demonstrate ROI to stakeholders — and projects that cannot demonstrate ROI get cancelled.
The fix is to define evaluation criteria before development begins, even if the criteria are imperfect. Human evaluation protocols (have domain experts rate outputs on defined rubrics), proxy metrics (factual accuracy against source documents, relevance scores from retrieval, response completeness checks), and A/B testing frameworks (does the new version perform better than the old version on a held-out set of queries?) provide measurable signals that enable iterative improvement. The criteria need not be perfect — they need to be consistent enough to distinguish improvement from regression.
We see teams skip this step because “GenAI output is inherently subjective.” The subjectivity is real, but it does not make evaluation impossible — it makes evaluation more effortful. Skipping evaluation does not avoid the subjectivity; it just defers the discovery that the system does not meet expectations until after launch.
Pattern 3: Scope inflation driven by capability fascination
GenAI models are impressively capable across a broad range of tasks. This breadth creates a scope inflation pattern: the project starts with a focused use case (answer customer questions about product features), and the scope expands as stakeholders discover the model can do other things (also handle returns, also generate product descriptions, also summarise customer feedback, also draft internal memos). Each expansion is individually reasonable. Collectively, they transform a focused project with a clear success criterion into an unfocused platform initiative with no clear success criterion.
The scope inflation pattern is particularly dangerous with GenAI because the demo for each new capability is easy — the model already “knows” how to do it, so adding the capability looks cheap. The production engineering for each new capability is not cheap: each new capability needs its own evaluation criteria, its own data sources, its own integration points, its own failure modes, and its own monitoring. The gap between “the model can do this in a demo” and “the model can do this reliably in production” is a per-capability gap, not a one-time gap.
Our recommendation: define the v1 scope as the minimum viable capability that delivers measurable value, and resist scope expansion until v1 is deployed, measured, and validated. The feasibility assessment approach provides the framework for scoping v1 correctly.
Pattern 4: Integration underestimation
GenAI models operate on text (or images, or code) — they consume input and produce output. Making that input/output cycle useful in a business context requires integration: feeding the model the right context (from databases, documents, APIs), delivering the model’s output to the right destination (CRM records, tickets, emails, documents), and ensuring the entire cycle operates within the organisation’s security, compliance, and access control framework.
Integration is consistently the most underestimated component of GenAI projects. In our experience, integration work — connecting to data sources, building retrieval pipelines, implementing output routing, handling authentication, and building monitoring — accounts for 50–70% of the total project effort. The model itself (selection, prompt engineering, fine-tuning) accounts for 15–25%. The remaining effort is evaluation and testing.
Projects that allocate budget based on the model effort — “fine-tuning should take two weeks, so the project is three weeks” — underestimate the total effort by 3–5×. The integration effort is where the schedule slips accumulate, because integration depends on the state of external systems that the GenAI team does not control.
Pattern 5: Cost model surprise
GenAI API costs scale linearly with usage. A GPT-4 application that costs £50 per day during testing costs £5,000 per day when 100× more users adopt it. The per-query cost (£0.01–£0.10 depending on the model, context length, and output length) seems trivial in isolation. At scale, it becomes a material operating expense.
Self-hosted models (Llama, Mistral, Phi) eliminate the per-query API cost but introduce GPU infrastructure cost — and the infrastructure cost for running a 70B-parameter model is not trivial (£2,000–£5,000 per month for cloud GPU inference infrastructure capable of serving production load).
The cost model must be projected to scale before the project is committed. A GenAI application that delivers £100,000 in annual value at a cost of £150,000 in annual inference costs is not viable — and the cost projection should have been done during feasibility, not discovered after launch.
What prevents these failures
Every pattern described above is preventable through structured project assessment at the start — before the demo, before the funding decision, before the development commitment. The assessment evaluates: scope definition and success criteria, evaluation methodology and metrics, integration requirements and effort, cost projection at target scale, and the demo-to-production gap for each capability.
If your organisation is considering or has started a GenAI project and the assessment described above has not been conducted, a GenAI Feasibility Assessment evaluates the project against these failure patterns and identifies the specific risks before the investment accumulates. Our generative AI practice focuses on preventing these predictable failures.