Why Generative AI Projects Fail Before They Launch

The failure rate is not surprising — the failure patterns are predictable

Generative AI projects fail at distinctly higher rates than conventional ML deployments — and for different reasons. The technology is newer, the gap between a working demo and a reliable production system is wider, and the failure modes are structurally distinct: hallucination, evaluation without ground truth, and uncontrolled scope inflation have no direct equivalent in classical AI work. These GenAI-specific patterns push projects toward the upper end of the broader enterprise AI failure rate range because the production engineering challenges are harder to anticipate from a prototype than they are in traditional ML pipelines. The headline failure rate is a symptom, not a diagnosis. The useful question is narrower: why do GenAI projects specifically fail, and can the failure be predicted before the investment is committed?

The answer to both is yes. GenAI project failures cluster around a small number of predictable patterns, and each is attributable — someone approved the scope, someone accepted the data, someone chose the architecture, someone signed off without success criteria. Identifying these patterns before development begins, or during the first weeks of a project before the investment accumulates, is the difference between a controlled decision to proceed or pivot and an expensive discovery that the project was never going to work.

Each pattern below follows the same shape: what the structural mistake is, how it manifests during the project, and what specific assessment prevents it before development begins. Treat the set as a diagnostic checklist — if the prevention condition is not met, the project carries that specific risk.

Why does the demo-to-production gap kill GenAI projects?

A GenAI demo is easy to build and impressive to present. A RAG chatbot powered by GPT-4, connected to a company knowledge base, running in a notebook — this can be built in days and shown to stakeholders within a week. The demo answers questions. The stakeholders are impressed. The project gets funded.

The demo did not address: authentication (who is allowed to ask what?), hallucination management (what happens when the model generates a confident but incorrect answer?), latency requirements (the demo tolerated five-second response times; production requires sub-one-second), cost at scale (the demo processed fifty queries; production will process tens of thousands per day at a few pence each), integration with existing systems (the demo ran standalone; production must reach the CRM, the ticketing system, and the internal SSO), monitoring (how does the team know when the model is producing bad output?), and update management (the knowledge base changes daily; how does the RAG index stay current?).

Each of these is a solvable engineering problem. Collectively, in our experience across GenAI engagements, they represent roughly 80–90% of the project’s total effort and cost (observed-pattern from our engagements, not a benchmarked industry rate). The demo represents 10–20%. Projects funded on demo capability without scoping the production engineering are systematically underestimated, and they fail when the budget allocated for demo-equivalent effort runs out before the production engineering is complete.

Pattern 2: Evaluation without ground truth

A GenAI model generates text. Is the text good? For many GenAI use cases — drafting, summarisation, conversational responses — “good” is partly subjective. There is no clean ground truth to compare against, and no single objective metric that separates a strong output from a weak one.

This creates an evaluation problem that cascades through the project lifecycle. Without objective evaluation criteria the team cannot measure whether a change improves the system; without measurable improvement, iteration is blind; without measurable progress, the project cannot demonstrate ROI to stakeholders — and projects that cannot demonstrate ROI get cancelled. The absence is upstream of every later failure.

The fix is to define evaluation criteria before development begins, even if the criteria are imperfect. Human-evaluation protocols where domain experts rate outputs on a defined rubric, proxy metrics such as factual accuracy against source documents, retrieval relevance, or completeness checks, and A/B testing frameworks that compare a candidate version against the incumbent on a held-out query set — all of these produce signals consistent enough to distinguish improvement from regression. The criteria do not need to be perfect. They need to be stable.

We see teams skip this step because “GenAI output is inherently subjective.” The subjectivity is real, but it does not make evaluation impossible — it makes evaluation more effortful. Skipping evaluation does not avoid the subjectivity. It defers the discovery that the system does not meet expectations until after launch, when the cost of finding out is highest.

Pattern 3: Scope inflation driven by capability fascination

GenAI models are impressively capable across a broad range of tasks, and that breadth creates a specific failure pattern. The project starts with a focused use case — answer customer questions about product features — and the scope expands as stakeholders discover the model can do other things too: handle returns, generate product descriptions, summarise customer feedback, draft internal memos. Each expansion is individually reasonable. Collectively, they transform a focused project with a clear success criterion into an unfocused platform initiative with no clear success criterion at all.

This is particularly dangerous with GenAI because the demo for each new capability is cheap — the model already “knows” how to do it, so adding the capability looks free. The production engineering for each new capability is not free: every capability needs its own evaluation criteria, its own data sources, its own integration points, its own failure modes, and its own monitoring. The gap between “the model can do this in a demo” and “the model can do this reliably in production” is a per-capability gap, not a one-time gap.

The corrective is unromantic. Define the v1 scope as the minimum viable capability that delivers measurable value, and resist scope expansion until v1 is deployed, measured, and validated. The feasibility assessment approach we recommend provides the framework for scoping v1 correctly, and it forces the scope conversation to happen before contracts are signed rather than after the third demo.

Pattern 4: Multi-agent over-engineering

A close cousin of scope inflation, but structurally different. Multi-agent architectures — planner agents handing off to retrieval agents handing off to tool-using agents — are a legitimate pattern when the problem genuinely requires decomposition, branching reasoning, and tool orchestration. They are also the most over-applied architecture in the current GenAI landscape, deployed to problems that a single prompt, a deterministic rule, or a small function call would have solved more reliably and at a fraction of the cost.

The symptom set is recognisable. Latency that no amount of caching brings under target because every user query triggers three to five model calls in sequence. Failure modes that defy debugging because the bug is an emergent interaction between agents rather than a defect in any one of them. Evaluation that becomes impossible because the team can no longer say what “correct behaviour” means at the system level. Cost projections that surprise everyone because each agent call multiplies the token bill.

The prevention is a complexity check before architecture is locked. State the problem in one sentence. If the sentence does not contain branching decisions, tool selection over a non-trivial tool set, or multi-step planning that a deterministic workflow cannot represent, the right architecture is probably a single well-scoped model call wrapped in conventional software. Multi-agent frameworks are a tool for problems that need them. When LangGraph or CrewAI get pulled in because the team wants to use them, the project has already taken on risk that the problem did not justify.

Pattern 5: Integration underestimation

GenAI models operate on text — or images, or code — consuming input and producing output. Making that cycle useful in a business context requires integration: feeding the model the right context from databases, documents, and APIs; delivering the output to the right destination such as CRM records, tickets, emails, documents; and ensuring the cycle operates within the organisation’s security, compliance, and access-control framework.

Integration is consistently the most underestimated component of GenAI projects. In our experience, integration work — connecting to data sources, building retrieval pipelines, implementing output routing, handling authentication, and instrumenting monitoring — accounts for roughly 50–70% of total project effort (observed-pattern from our engagements, not a benchmarked industry rate). The model work itself (selection, prompt engineering, light fine-tuning) accounts for 15–25%. The remainder is evaluation and testing.

Projects that allocate budget based on model effort — “fine-tuning should take two weeks, so the project is three” — in our observed range across GenAI engagements underestimate total effort by roughly 3–5× (observed-pattern, not a benchmarked rate). The integration phase is where the schedule slips accumulate, because integration depends on the state of external systems the GenAI team does not control: an SSO migration that arrives mid-project, a CRM schema that changes under the integration, a retrieval index that needs nightly rebuilds nobody scoped.

Pattern 6: Cost model surprise

GenAI API costs scale linearly with usage, which sounds benign and is not. As an illustrative planning example from our GenAI engagements (observed-pattern planning heuristic, not a benchmarked industry rate): a GPT-4-class application that costs around £50 per day during internal testing can cost on the order of £5,000 per day once usage scales 100× through general rollout. The per-query cost — typically a few pence depending on the model, context length, and output length — looks trivial in isolation. At scale, it becomes a material operating expense that has to be funded out of the same business case that justified the project.

Self-hosted models such as Llama, Mistral, or Phi eliminate the per-query API charge but introduce GPU infrastructure cost. Running a 70B-parameter model for production load is not cheap — typically a four-figure monthly bill for cloud GPU inference capable of serving sustained traffic, before redundancy and burst capacity. Switching from API to self-hosted shifts the cost curve, it does not flatten it.

The cost model must be projected to scale before the project is committed. A GenAI application that delivers £100,000 in annual value at £150,000 in annual inference cost is not viable — and the projection should have been done during feasibility, not discovered after launch. This is one of the most common GenAI-specific failures because conventional ML projects do not have a per-query operating cost of this magnitude. The instinct to ignore the cost line is inherited from a world where it did not exist.

GenAI project preflight checklist

Before committing budget and timeline to a GenAI project, the team should be able to confirm every item below. Any unchecked item represents a known failure risk that an honest feasibility assessment would surface.

Production requirements scoped beyond the demo. Authentication, latency targets, cost at scale, monitoring, and update management have been identified and estimated — not deferred as “we’ll figure it out later.”
Evaluation criteria defined before development begins. Human-evaluation rubrics, proxy metrics (factual accuracy, retrieval relevance, completeness), or A/B testing frameworks are in place to distinguish improvement from regression.
Ground truth or reference data available for evaluation. Domain experts have been identified to rate outputs, or source documents exist against which factual accuracy can be verified.
v1 scope locked to a single, minimum viable capability. The project delivers one focused use case with a clear success criterion — scope expansion is deferred until v1 is deployed and validated.
Architecture matched to problem complexity. Multi-agent or multi-step orchestration is justified by branching decisions or tool selection the problem actually requires; otherwise the architecture is a single model call wrapped in conventional software.
All integration points mapped with effort estimates. Data sources, retrieval pipelines, output destinations, SSO, and compliance requirements are documented, and integration work is estimated at roughly 50–70% of total project effort (observed-pattern planning heuristic from our engagements, not a benchmarked rate).
Cost model projected at target user scale. Per-query API costs or self-hosted GPU infrastructure costs have been calculated at production volume, and the projected operating cost is justified by the projected business value.
Demo-to-production gap quantified per capability. Each capability the system will support has been assessed for the engineering effort required to move it from demo to production — not assumed trivial because the demo works.

Who owns the failure

A useful test, when looking at a stalled GenAI project, is to ask which patterns above are engineering failures and which are buyer-side scoping failures. Most of them are buyer-side. Infeasible scope — scoping a GenAI project to replace human judgement in a domain where the model cannot match human performance — is a decision made before any engineer touches the code. The absence of success criteria is the same: someone agreed to fund a project without defining what “done” would look like. Architecture over-engineering is usually a joint failure but the budget signed it off.

The engineering team owns the demo-to-production gap once they have the requirements, and they own evaluation infrastructure once the criteria are defined. They do not own the upstream scoping decisions that shaped the project before they arrived. Naming this accurately matters, because the corrective for a scoping failure is not a better model — it is a feasibility conversation that should have happened earlier.

FAQ

What failure patterns are specific to generative AI projects, as opposed to AI projects in general?

The GenAI-specific patterns are the demo-to-production gap (wider than in classical ML because GenAI demos are unusually easy and production is unusually hard), evaluation without ground truth (text and image outputs lack the objective metrics classical ML benefits from), scope inflation driven by capability breadth (one model can do many things, which invites unfocused expansion), multi-agent over-engineering (a category of architecture mistake that did not exist in classical ML), and per-query operating cost at scale (classical ML rarely has a marginal cost per inference of meaningful size). General AI project failure patterns — data quality, organisational readiness, integration complexity — apply too, but the patterns above are the ones the team will not have encountered in earlier ML work.

Why does a GenAI prototype that works on curated data fail on production data?

The prototype was built and demonstrated on a hand-picked corpus that does not match the distribution, noise level, or edge cases of production data. Production documents are inconsistently formatted, contain conflicting information, reference systems the prototype was never tested against, and arrive at volumes that change retrieval behaviour. The prototype’s apparent quality came partly from the curation, not from the model — and once that curation is removed, the failure rate visible in production was always there, just hidden.

When does multi-agent over-engineering kill a GenAI project that simple automation would have solved?

When the problem can be stated in one sentence without branching decisions, without selection across a non-trivial tool set, and without multi-step planning that a deterministic workflow cannot represent — yet the team has built three or more cooperating agents anyway. The symptoms are latency that cannot be cached down to target, failure modes that defy debugging because they are emergent interactions, evaluation that becomes impossible at the system level, and cost projections that surprise everyone because each user query fans out into several model calls.

How do infeasible-scope failures show up before launch, and who is accountable when they do?

They show up as success criteria that read like “the model should match an experienced human’s judgement,” as use cases where the cost of a wrong answer is high enough that no current model’s error rate is acceptable, and as scopes that include tasks the model cannot do reliably even in a clean demo. Accountability sits with the buyer who approved the scope — the engineering team can build what is possible, but cannot make an infeasible scope feasible by working harder. This is why feasibility assessment belongs upstream of contracting.

Why do GenAI projects launch without measurable success criteria, and what should those look like?

They launch without criteria because the demo was persuasive and the assumption was that “GenAI output is inherently subjective.” It is partly subjective, but not unmeasurable. Workable criteria combine a human-evaluation rubric (domain experts rate outputs against a defined scale), proxy metrics tied to the use case (factual accuracy against source documents for RAG, task completion rate for agentic workflows, retrieval relevance scores), and a held-out evaluation set used consistently across iterations. The criteria need not be perfect; they need to be stable enough that the team can tell improvement from regression.

Which GenAI failure modes are attributable to the buyer’s scoping decision rather than the engineering team?

Infeasible scope, the absence of success criteria, scope inflation that was approved rather than resisted, and the cost-model surprise that comes from not modelling per-query cost at projected user volume. These are decisions made before engineering begins, or alongside it at a level engineering does not control. The demo-to-production gap is jointly owned once requirements are clear, and evaluation infrastructure is engineering’s once criteria are defined — but the upstream scoping failures are not absorbed by working the engineering team harder.

What prevents these failures

Every pattern described above is preventable through structured project assessment at the start — before the demo, before the funding decision, before the development commitment. The assessment evaluates scope definition and success criteria, evaluation methodology and metrics, architecture matching to problem complexity, integration requirements and effort, cost projection at target scale, and the demo-to-production gap for each capability the system will support.

Organisations that skip this step and discover the patterns mid-project face the same choice: absorb the sunk cost and reset, or continue investing in a trajectory the data already shows will not deliver. We have seen both decisions made; the first is uncomfortable and the second is more expensive. A GenAI Feasibility Assessment identifies the specific risks before the investment accumulates, which is the only point at which the cost of finding out is still small.