What should an AI POC actually prove? A demo shows what an AI model can do. A proof of concept proves whether an AI model should be built for production. The distinction matters: demos convince stakeholders; POCs inform decisions. A POC that does not answer the question “should we invest in building this for production?” has failed, regardless of how impressive the demo looks. The question “should we invest?” decomposes into four sub-questions. Is the technical approach feasible with our data? What does success look like, and can we measure it? What is the expected return on the production investment? And what value does the POC itself deliver, independent of the production decision? Those four questions define the four sections that every AI POC report must contain — and the rest of this piece walks each section, with a worked failure case at the end. Demo, prototype, POC — which fails where Teams use the three words interchangeably, and the conflation is where most pilots start to go wrong. A demo is built to convince — it runs on the data that makes the model look best, and it fails the moment a stakeholder asks about scale or edge cases. A prototype is built to explore the design space — it tests interactions, UX flow, or an architectural sketch, and it fails the moment you need to evaluate quantitative model behaviour against a baseline. A POC is built to inform a go/no-go decision — it fails the moment it cannot produce defensible numbers for the four sections below. Artifact What it proves Where it fails Demo The model can produce a plausible output on chosen data Stakeholder pressure to “just ship it” without integration or scale evidence Prototype The end-to-end shape of the system is coherent Asked to substitute for a POC report; has no defended thresholds POC The production investment is justified (or not) Treated as a demo with extra slides; no kill criterion defined The classification matters because each artifact has a different kill criterion. A demo dies when the audience disengages. A prototype dies when the next iteration adds no information. A POC dies — or proceeds — when it answers the four sections honestly. Section 1: POC structure — what was tested and how The POC structure section documents the technical approach, the data used, the evaluation methodology, and the scope boundary. It is the reproducibility section — anyone reading it should be able to understand exactly what was tested, what was not tested, and what assumptions underlie the results. Technical approach. What model architecture was used, what training or configuration was applied, and what alternatives were considered and rejected. The architecture choice should include rationale: “We used a fine-tuned BERT classifier because the task is multi-label text classification with domain-specific terminology. We considered GPT-4 with few-shot prompting but the per-inference cost at the client’s volume (100,000 classifications per day) exceeded budget by roughly 5× — illustrative from our consulting engagements, not a benchmarked industry rate.” Data. What data was used for training and evaluation, how it was sourced, what preprocessing was applied, and what quality issues were identified. The data section should be honest about limitations: if the POC used a curated subset of the production data, the results may not generalise to the full production data distribution. PyTorch and Hugging Face make it trivial to train on a clean subset and then forget to test on the messy one — name that risk explicitly. Evaluation methodology. How the model’s output was evaluated, what metrics were used, and how the evaluation dataset was constructed. The evaluation section should distinguish between the POC evaluation (on a held-out subset of the curated data) and the expected production evaluation (on the full production data distribution, with its noise, edge cases, and drift). Scope boundary. What the POC did not test — integration, scale, latency, edge cases, adversarial inputs — and what the implications are for the production decision. A POC that tested the model on 500 curated examples cannot make claims about performance at 100,000 daily inferences with uncurated input. Naming the boundary is not a weakness in the report; omitting it is. Section 2: Success criteria — what “good enough” means Success criteria must be defined before the POC begins, not after the results are available. Defining criteria after results creates the temptation to draw the target around the arrow — adjusting the criteria to match whatever the model achieved. Metric definitions. What specific metrics will be used to evaluate success? For a classification task: accuracy, precision, recall, and F1 on each class, with particular attention to the metric that matters most for the business context — precision if false positives are expensive, recall if false negatives are dangerous. Threshold values. What values of each metric constitute success? An illustrative example from our POC scoping engagements (planning heuristic, not a benchmarked industry rate): “The model must achieve at least 90% precision and 85% recall on the ‘urgent’ class, measured on a held-out test set of at least 500 examples, to justify production investment.” The thresholds should be derived from business requirements — what accuracy does the current process achieve, what accuracy does the business need — rather than from ML conventions. In our experience, 95% accuracy is not always necessary and 80% accuracy is not always sufficient. Comparison baseline. What is the current performance — human accuracy, rule-based system accuracy, or the cost and time of the current manual process? The POC’s value is measured against this baseline, not against a theoretical ideal. As an illustrative example from our consulting engagements: a model that achieves 88% accuracy is impressive against a 70% human baseline and unimpressive against a 92% rule-based baseline. The kill criterion. A POC needs an explicit kill criterion, written down before the work starts and signed off by the budget owner. “If precision on the urgent class is below 80% on the unfiltered evaluation set, we stop and write up the data quality findings.” Without a kill criterion, sunk-cost reasoning takes over the moment the model produces anything at all. With one, the POC team has institutional cover to report a no-go honestly. Section 3: ROI measurement — the production economics The ROI section translates the POC results into a production cost-benefit analysis. This is the section that determines whether the project proceeds to production, and it must be based on realistic cost estimates, not on the POC’s operating cost. Production development cost. The engineering effort to move from POC to production: model hardening, integration development, infrastructure setup, testing, and deployment. Across our engagements this is typically on the order of 5–15× the POC effort (observed range, not a benchmarked industry rate), depending on integration complexity and infrastructure requirements. Our enterprise AI project failure analysis shows that integration is consistently the most underestimated component. Operating cost. Infrastructure (compute, storage, networking), API costs (if using third-party model APIs), data pipeline maintenance, model monitoring, and periodic retraining. The operating cost should be projected at the expected production volume, not at the POC volume. The shape of the cost curve also changes — Triton or TensorRT serving on dedicated GPU instances behaves very differently from a batch script on a developer laptop. Benefit quantification. The financial value of the model’s output: cost savings (reduced labour, reduced errors, faster processing), revenue impact (improved customer experience, better targeting, higher conversion), or risk reduction (faster fraud detection, earlier equipment failure prediction). The benefit must be quantifiable — “improved customer experience” is not a benefit unless it is translated into a measurable financial outcome (reduced churn, NPS movement correlated with revenue retention). Payback period. Time from production deployment to cumulative benefit exceeding cumulative cost. A project with a 3-month payback period is compelling; a project with a 36-month payback period requires stronger strategic justification. Section 4: Packageable value — what the POC itself delivers The POC should deliver value independent of the production decision. Even if the project does not proceed to production — the ROI is insufficient, the technical approach needs more research, or the organisation’s priorities shift — the POC should produce artifacts that the organisation can use. Data inventory and quality assessment. The POC process typically reveals more about the organisation’s data landscape than any prior audit. The data findings — what data exists, where it lives, what quality issues it has, and what gaps exist — are valuable regardless of whether the AI project proceeds. Baseline performance measurement. The POC establishes a measured baseline for the current process: how accurate it is, how long it takes, what it costs. This baseline informs all future improvement initiatives, AI or otherwise. Technical feasibility determination. The POC definitively answers whether the technical approach works with the organisation’s data. A negative result — the model cannot achieve the required accuracy with the available data — is valuable, because it prevents a larger investment in a project that would have failed. Reusable evaluation framework. The success criteria, metrics, and evaluation methodology developed for the POC can be reused for future AI projects. The evaluation framework is an organisational capability, not a project-specific artifact. It is also the natural input to the production monitoring stack — see MLOps for organisations that have never operationalised a model for what happens to that framework once a project does proceed. Anatomy of a failed POC A European insurance company ran a 6-week POC for automated claims triage — classifying incoming claims into three urgency tiers to reduce adjuster workload. The demo was impressive: an operational measurement from that project showed the model achieved 91% accuracy on a curated test set of 800 claims, and stakeholders approved a £280K production build. Production failed within three weeks of deployment. The POC’s test set had been manually cleaned by the data science team — duplicates removed, ambiguous cases excluded, and inconsistent labels corrected. The production feed contained raw submissions with missing fields, scanned handwriting, and multi-language attachments; accuracy dropped to 64% (operational measurement from that deployment). The POC had no latency benchmark; the model required 4.2 seconds per classification, but the claims platform needed sub-500ms responses for the real-time triage workflow (operational measurement from the deployment). Operating cost had been projected from the POC’s batch-processing setup at £900 per month; the production API serving architecture required GPU instances costing £7,200 per month at the actual volume of 12,000 daily claims. Had the POC report included the four required sections — realistic evaluation on unfiltered data, predefined accuracy and latency thresholds, production-volume cost projection, and a baseline comparison against the existing rule-based triage (which achieved 74% accuracy at negligible cost — operational measurement from that project) — the go/no-go decision would have been “iterate on data quality and latency” rather than “proceed to production.” The £280K production investment would have been avoided. How POC methodology connects to production engineering A clean POC stops one decision short of production. What it hands over is a packet: the evaluation harness, the data inventory, the baseline numbers, the predefined thresholds, and the kill criterion that was (or was not) hit. That packet is the input to the production engineering phase. The transition from POC to a hardened, monitored, retraining-capable service is its own discipline — covered in our GenAI prototype-to-production work — and it depends on the POC having produced honest numbers in the first place. A POC that hides its scope boundary makes production engineering impossible to estimate. A POC that publishes its scope boundary makes the production phase plannable, even if the plan is “fix the data layer before you fix the model.” FAQ What should an AI proof of concept actually prove before an organisation commits to a full build? It should prove that the production investment is justified — not that the model produces a plausible output on a clean demo. That means evidence on four axes: the technical approach is feasible against the real (not curated) data distribution, the success criteria were defined up front and met, the production economics work at expected volume, and the POC itself produced reusable artifacts regardless of the go/no-go outcome. What is the difference between a demo, a prototype, and a POC — and why does each fail at a different stage? A demo is built to convince and fails when stakeholders ask about scale. A prototype is built to explore the design space and fails when asked to substitute for quantitative evidence. A POC is built to inform a go/no-go decision and fails when it has no kill criterion, no defended thresholds, and no baseline comparison. Treating any of them as the others is the most common scoping error. Which evaluation evidence must come out of a POC to be useful downstream? At minimum: a documented data lineage (what was used, how it was filtered, what was excluded), a performance envelope measured on the unfiltered distribution rather than the curated subset, and an integration-risk assessment that names what the POC did not test (latency under load, edge cases, adversarial inputs). Without those three, the downstream production team cannot estimate the work honestly. What is the realistic failure rate of AI POCs, and which scoping choices drive it? POC failure rates published by industry analysts vary widely and are directional rather than operational — we treat them as market-direction signals, not benchmarks. In our engagements, the scoping choices that most reliably drive failure are: evaluating on curated rather than production-representative data, defining success criteria after results are available, and projecting operating cost from POC infrastructure rather than from production-volume serving. When does a POC need a clean kill criterion, and how should it be defined up front? Every POC needs one, and it should be written down and signed off by the budget owner before any code is written. A clean kill criterion names a specific metric, a specific threshold, and a specific evaluation set (“precision on the urgent class below 80% on the unfiltered evaluation set”). Without it, sunk-cost reasoning takes over the moment the model produces output at all. How does an AI POC connect to the downstream production engineering work? The POC’s outputs — evaluation harness, data inventory, baselines, thresholds — are the inputs to production engineering. Hardening, integration, monitoring, and retraining infrastructure all key off those artifacts. A POC that publishes its scope boundary honestly makes the production phase plannable; one that hides it makes the production estimate fictional. The POC as a decision tool The purpose of a POC is to produce a go/no-go decision with sufficient evidence. The four sections above provide the evidence structure: the technical approach was tested under defined conditions (Section 1), against predefined success criteria with a kill criterion (Section 2), with quantified production economics (Section 3), and with independently valuable deliverables (Section 4). A POC report missing any of these sections is not a decision tool. It is a demo report dressed up as due diligence — and the £280K mistakes tend to follow. If an AI POC needs to be structured to inform a production decision rather than to demonstrate model capability, an AI Project Risk Assessment includes POC scoping and evaluation framework design.