The failure rate is not surprising — the failure patterns are predictable Generative AI projects fail at distinctly higher rates than conventional ML deployments — and for different reasons. The technology is newer, the gap between a working demo and a reliable production system is wider, and the failure modes are structurally distinct: hallucination, evaluation without ground truth, and uncontrolled scope inflation have no direct equivalent in classical AI projects. These GenAI-specific failure patterns push projects toward the upper end of the broader enterprise AI failure rate range because the production engineering challenges are harder to anticipate from a prototype than they are in traditional ML work. The failure rate itself is not informative — it is a symptom. The useful question is: why do GenAI projects specifically fail, and can the failure be predicted before the investment is committed? The answer to both questions is yes. GenAI project failures cluster around a small number of predictable patterns. Identifying these patterns before development begins — or during the first weeks of a project, before the investment accumulates — is the difference between a controlled decision to proceed or pivot, and an expensive discovery that the project was never going to work. Pattern anatomy Each failure pattern below follows the same structure: what the pattern is (the structural mistake teams make), how it manifests (the observable symptoms during the project), and what prevents it (the specific action or assessment that eliminates or mitigates the risk before development begins). This consistent structure makes the patterns usable as a diagnostic checklist — if the prevention condition is not met for any pattern, the project carries that specific risk. Why does the demo-to-production gap kill projects? A GenAI demo is easy to build and impressive to present. A RAG chatbot powered by GPT-4, connected to a company knowledge base, running in a Jupyter notebook — this can be built in days and shown to stakeholders within a week. The demo answers questions. The stakeholders are impressed. The project gets funded. The demo did not address: authentication (who is allowed to ask what?), hallucination management (what happens when the model generates a confident but incorrect answer?), latency requirements (the demo tolerated 5-second response times; production requires sub-1 second), cost at scale (the demo processed 50 queries; production will process 50,000 per day at £0.03 per query), integration with existing systems (the demo ran standalone; production must integrate with the CRM, the ticketing system, and the internal SSO), monitoring (how does the team know when the model is producing bad output?), and update management (the knowledge base changes daily; how does the RAG index stay current?). Each of these is a solvable engineering problem. Collectively, in our experience across GenAI engagements, they represent 80–90% of the project’s total effort and cost. The demo represents 10–20% (an observed range, not a benchmarked industry rate). Projects that are funded based on demo capability, without scoping the production engineering, are systematically underestimated — and they fail when the budget allocated for the demo-equivalent effort runs out before the production engineering is complete. Pattern 2: Evaluation without ground truth A GenAI model generates text. Is the text good? For many GenAI use cases — creative writing, marketing copy, conversational responses — “good” is subjective. There is no ground truth to compare against, no objective metric that separates a good output from a bad one. This creates an evaluation problem that cascades through the project lifecycle. Without objective evaluation metrics, the team cannot measure whether changes improve the system (did the new prompt template produce better responses?). Without measurable improvement, iteration is blind — each change might help, might hurt, or might be neutral, and the team cannot tell which. Without measurable progress, the project cannot demonstrate ROI to stakeholders — and projects that cannot demonstrate ROI get cancelled. The fix is to define evaluation criteria before development begins, even if the criteria are imperfect. Human evaluation protocols (have domain experts rate outputs on defined rubrics), proxy metrics (factual accuracy against source documents, relevance scores from retrieval, response completeness checks), and A/B testing frameworks (does the new version perform better than the old version on a held-out set of queries?) provide measurable signals that enable iterative improvement. The criteria need not be perfect — they need to be consistent enough to distinguish improvement from regression. We see teams skip this step because “GenAI output is inherently subjective.” The subjectivity is real, but it does not make evaluation impossible — it makes evaluation more effortful. Skipping evaluation does not avoid the subjectivity; it just defers the discovery that the system does not meet expectations until after launch. Pattern 3: Scope inflation driven by capability fascination GenAI models are impressively capable across a broad range of tasks. This breadth creates a scope inflation pattern: the project starts with a focused use case (answer customer questions about product features), and the scope expands as stakeholders discover the model can do other things (also handle returns, also generate product descriptions, also summarise customer feedback, also draft internal memos). Each expansion is individually reasonable. Collectively, they transform a focused project with a clear success criterion into an unfocused platform initiative with no clear success criterion. The scope inflation pattern is particularly dangerous with GenAI because the demo for each new capability is easy — the model already “knows” how to do it, so adding the capability looks cheap. The production engineering for each new capability is not cheap: each new capability needs its own evaluation criteria, its own data sources, its own integration points, its own failure modes, and its own monitoring. The gap between “the model can do this in a demo” and “the model can do this reliably in production” is a per-capability gap, not a one-time gap. Our recommendation: define the v1 scope as the minimum viable capability that delivers measurable value, and resist scope expansion until v1 is deployed, measured, and validated. The feasibility assessment approach provides the framework for scoping v1 correctly. Pattern 4: Integration underestimation GenAI models operate on text (or images, or code) — they consume input and produce output. Making that input/output cycle useful in a business context requires integration: feeding the model the right context (from databases, documents, APIs), delivering the model’s output to the right destination (CRM records, tickets, emails, documents), and ensuring the entire cycle operates within the organisation’s security, compliance, and access control framework. Integration is consistently the most underestimated component of GenAI projects. In our experience, integration work — connecting to data sources, building retrieval pipelines, implementing output routing, handling authentication, and building monitoring — accounts for 50–70% of the total project effort. The model itself (selection, prompt engineering, fine-tuning) accounts for 15–25%. The remaining effort is evaluation and testing. Projects that allocate budget based on the model effort — “fine-tuning should take two weeks, so the project is three weeks” — in our experience across GenAI engagements underestimate the total effort by 3–5× (an observed range, not a benchmarked industry rate). The integration effort is where the schedule slips accumulate, because integration depends on the state of external systems that the GenAI team does not control. Pattern 5: Cost model surprise GenAI API costs scale linearly with usage. As an illustrative example from our GenAI engagements (planning heuristic, not a benchmarked industry rate): a GPT-4 application that costs £50 per day during testing costs £5,000 per day when 100× more users adopt it. The per-query cost (£0.01–£0.10 depending on the model, context length, and output length) seems trivial in isolation. At scale, it becomes a material operating expense. Self-hosted models (Llama, Mistral, Phi) eliminate the per-query API cost but introduce GPU infrastructure cost — and the infrastructure cost for running a 70B-parameter model is not trivial (£2,000–£5,000 per month for cloud GPU inference infrastructure capable of serving production load). The cost model must be projected to scale before the project is committed. A GenAI application that delivers £100,000 in annual value at a cost of £150,000 in annual inference costs is not viable — and the cost projection should have been done during feasibility, not discovered after launch. GenAI project preflight checklist Before committing budget and timeline to a GenAI project, the team should be able to confirm every item below. Any unchecked item represents a known failure risk. Production requirements scoped beyond the demo. Authentication, latency targets, cost at scale, monitoring, and update management have been identified and estimated — not deferred as “we’ll figure it out later.” Evaluation criteria defined before development begins. Human evaluation rubrics, proxy metrics (factual accuracy, retrieval relevance, completeness), or A/B testing frameworks are in place to distinguish improvement from regression. Ground truth or reference data available for evaluation. Domain experts have been identified to rate outputs, or source documents exist against which factual accuracy can be verified. v1 scope locked to a single, minimum viable capability. The project delivers one focused use case with a clear success criterion — scope expansion is deferred until v1 is deployed and validated. All integration points mapped with effort estimates. Data sources, retrieval pipelines, output destinations, SSO, and compliance requirements are documented, and integration work is estimated at 50–70% of total project effort (planning heuristic from our engagements, not a benchmarked industry rate). Cost model projected at target user scale. Per-query API costs or self-hosted GPU infrastructure costs have been calculated at production volume, and the projected operating cost is justified by the projected business value. Demo-to-production gap quantified per capability. Each capability the system will support has been assessed for the engineering effort required to move from demo to production — not assumed to be trivial because the demo works. What prevents these failures Every pattern described above is preventable through structured project assessment at the start — before the demo, before the funding decision, before the development commitment. The assessment evaluates: scope definition and success criteria, evaluation methodology and metrics, integration requirements and effort, cost projection at target scale, and the demo-to-production gap for each capability. Organisations that skip this step and discover these failure patterns mid-project face the same choice: absorb the sunk cost and reset, or continue investing in a trajectory the data already shows will not deliver. A GenAI Feasibility Assessment identifies the specific risks before the investment accumulates.