Most AI POCs answer the wrong question An AI proof of concept that ends with “the model achieved 87% accuracy on our test set” has not proven anything useful. Accuracy on a held-out test set does not tell you whether the AI will improve the business outcome it was built for, at the cost and latency the business requires, with the reliability the system needs in production. A well-designed POC ends with a clear go/no-go decision based on criteria that were defined before the project started. The four criteria that determine POC design Before starting any AI POC, these four questions must have specific answers: 1. What is the baseline? What is the current performance of the process the AI will replace or augment? Without a baseline, there is no way to measure improvement. “We want to automate X” is not a baseline. “The current process takes 4 minutes per transaction with a 12% error rate” is a baseline. 2. What does success look like in business terms? Not model accuracy. Business outcome: cost per unit, time per transaction, error rate, conversion rate, false positive rate. Define the minimum threshold at which the AI creates enough value to justify deployment costs. 3. What does failure look like? Define the failure condition: if the AI achieves below X, we do not deploy. This is as important as the success condition. POCs without failure conditions almost always find a way to declare success. 4. What is the path from POC to production? If the POC succeeds, what are the next steps? Who owns deployment? What infrastructure is required? What is the budget? If there is no clear path to production at the start, a successful POC often leads to nothing. Typical 6-week AI POC structure Week Activities Deliverable 1 Data audit, baseline measurement, success criteria finalization Confirmed data availability, success/failure thresholds 2–3 Data preparation, initial model development Baseline model performance 4 Model iteration, validation Performance on evaluation set 5 Integration prototype, latency/cost measurement End-to-end performance metrics 6 Decision review against pre-defined criteria Go/no-go recommendation What scope to avoid in a POC A POC should not attempt to solve the full production problem. The scope should be the minimum that can demonstrate whether the core technical assumption is valid. Common scope mistakes: Trying to handle all edge cases during the POC Building full production infrastructure before validating the model Attempting to integrate with all downstream systems Optimizing model performance before validating business impact For the broader framework of what a proof of concept should prove and how to structure it, what an AI POC should actually prove provides the full methodology. What separates a useful POC from a misleading one? A useful AI POC answers a specific question: “Can an ML model achieve X performance on Y data under Z constraints?” A misleading POC answers a vague question: “Can AI help with our business problem?” The specificity of the question determines whether the POC’s results are actionable. Defining success criteria before starting the POC is the most important step. Without predefined criteria, the POC becomes a demo rather than an experiment — results are interpreted favourably regardless of actual performance because there is no objective standard for comparison. We define success criteria collaboratively with business stakeholders, translating business requirements (e.g., “reduce manual review time by 50%”) into measurable model performance targets (e.g., “achieve 90% precision at 80% recall on the test dataset”). Data representativeness is the second critical factor. A POC trained on curated, clean data demonstrates what the model can do under ideal conditions. A POC trained on representative production data demonstrates what the model will do under real conditions. The gap between these is often 10–20% in model performance, and POCs that use curated data consistently overestimate production performance. We structure POCs in two phases. Phase 1 (2 weeks): establish a baseline using the simplest viable approach (often a non-ML heuristic) to confirm that the problem is well-defined and the data supports the task. Phase 2 (2–4 weeks): develop and evaluate the ML approach against the Phase 1 baseline. If the ML approach does not meaningfully outperform the baseline, the POC has produced a valuable negative result — the problem does not benefit from ML given the available data, and the organisation should not invest in a production ML deployment. The two-phase structure prevents the most common POC failure: spending 6 weeks developing a complex model only to discover that the data does not support the task, or that a simple rule-based approach performs equivalently. Choosing the right evaluation metrics is the final critical design decision. Classification accuracy is rarely sufficient — precision and recall at specific operating points, calibration quality, and performance on minority classes matter more than aggregate accuracy for most production applications. We define metrics that align with the business cost function: if false positives are 10× more costly than false negatives, the evaluation metric should reflect this asymmetry.