The demo worked perfectly In our experience across CV deployments, the object detection model scored 94% mAP on the evaluation dataset (an operational measurement from one such project). The integration test passed. The stakeholder demo was clean — bounding boxes appeared where they should, confidence scores were high, and the engineering team felt ready to deploy. Four weeks into production, the false-positive rate was three times higher than testing predicted, the model missed an entire class of defect it had never encountered in training, and the operations team was spending more time managing the model’s errors than they had spent on the manual process it replaced. This is not an unusual outcome. It is the expected outcome when an off-the-shelf model — YOLO, Faster R-CNN, EfficientDet, or any pre-trained detection architecture — is deployed into a production environment that differs from its training conditions in ways that benchmark evaluation does not measure. The failure is not in the model architecture. The failure is in the assumption that benchmark accuracy transfers to production reliability. Where does the accuracy gap come from? The gap between benchmark performance and production performance has specific, identifiable causes. Understanding these causes is the difference between diagnosing a deployment failure retroactively and preventing it structurally. Lighting and environmental variation. Benchmark datasets are typically captured under controlled conditions — consistent lighting, stable backgrounds, uniform image quality. Production environments are not controlled in the same way. A warehouse camera operates under fluorescent lighting that shifts colour temperature across the day. An outdoor surveillance system contends with shadows, glare, weather, and seasonal lighting changes. A manufacturing inspection station has lighting that degrades as bulbs age. Each of these variations introduces a distribution shift between the training data and the production data — and the model’s accuracy degrades proportionally to the magnitude of that shift, often without any visible error signal until someone audits the results. Class distribution mismatch. Benchmark datasets are typically class-balanced: roughly equal numbers of examples per category, or at least a distribution that is representative of the evaluation task. Production environments are rarely class-balanced. In our experience across manufacturing CV engagements, in manufacturing quality control 97–99% of units are defect-free — the positive class (defect present) is extremely rare (an observed range, not a benchmarked industry rate). A model trained on a balanced dataset will produce a different precision-recall trade-off in production than it showed during evaluation, because the base rate of the positive class has changed by an order of magnitude. The practical consequence: a false-positive rate that was acceptable at 1% in evaluation becomes operationally problematic when it is applied to millions of units per month (an observed pattern across our CV engagements, not a benchmarked industry rate). Domain-specific failure modes. Every deployment domain has failure classes that are specific to its operational context — and that off-the-shelf models have never seen. A retail shelf monitoring system encounters products that partially occlude each other, promotional displays that change the visual context weekly, and product packaging redesigns that change the appearance of items the model was trained to recognise. A medical imaging system encounters imaging artifacts, patient positioning variations, and pathology presentations that differ from the training distribution. These are not edge cases — in our experience, they are the normal operating conditions of the specific domain, and they are invisible to a model that was trained on a generic or cross-domain dataset. Why testing on a held-out set does not catch these failures The standard ML evaluation methodology — train on one portion of the dataset, evaluate on a held-out portion — measures the model’s ability to generalise within the training distribution. It does not measure the model’s ability to generalise to a different distribution, which is exactly what production deployment requires. A held-out test set drawn from the same dataset as the training data shares the same lighting conditions, the same class distribution, the same domain characteristics, and the same failure modes. Evaluating on this set tells you how well the model has learned the dataset. It does not tell you how the model will behave when the camera angle changes, the lighting shifts, the product mix evolves, or a defect type appears that was not represented in the training data. We encounter this pattern regularly: a team evaluates a model on a held-out set, reports strong metrics, deploys to production, and discovers that the production accuracy is 10–20 percentage points below the evaluation accuracy. The team’s first instinct is usually to retrain with more data or try a different architecture. In our experience, the more productive first step is to characterise the distribution gap between training data and production data — because the gap, once identified, often reveals specific correctable causes (lighting normalisation, class rebalancing, domain-specific augmentation) rather than requiring a wholesale model replacement. What production-grade evaluation actually requires Moving from benchmark evaluation to production evaluation requires testing against the actual conditions of deployment, not against a subset of the training distribution. Environment-representative test data. The evaluation dataset must be captured from the production environment — same cameras, same lighting, same operating conditions, same class distribution. If the production environment changes across shifts, seasons, or product cycles, the evaluation dataset must include samples from each variant. This is more expensive to construct than a curated benchmark dataset, but it is the only evaluation approach that predicts production performance. Domain-specific metrics. Overall accuracy and mAP are useful for architecture comparison but insufficient for production decision-making. Production evaluation requires metrics that map to operational impact: false-positive rate at the operating threshold (how many good items will be incorrectly flagged?), false-negative rate per defect class (which defect types will be missed?), performance across data subsets (does the model degrade for specific product variants, lighting conditions, or time periods?), and latency under production load (can the model maintain throughput at line speed?). These metrics are not exotic — they are the questions that the operations team will ask after deployment, and answering them before deployment prevents the discovery phase from happening in production. Out-of-distribution behaviour characterisation. What happens when the model encounters an input it was not trained on? Does it assign a low confidence score (desirable — the system can flag uncertain cases for human review) or a high confidence score on an incorrect class (dangerous — the system fails silently)? Characterising this behaviour before deployment requires deliberately testing with inputs that fall outside the training distribution — novel objects, adversarial lighting, corrupted images. The model’s behaviour on these inputs determines whether it fails safely or fails silently, which is the difference between a production system that degrades gracefully and one that produces undetected errors. The quality control workflows that integrate AI and computer vision depend entirely on this production-grade evaluation. A model that has not been evaluated against production conditions is a model whose production failure rate is unknown — not zero, unknown. The production readiness question The decision to deploy a computer vision model is not a binary pass/fail on a benchmark. It is an assessment of whether the model, the data pipeline, the deployment infrastructure, and the monitoring systems are collectively ready to operate reliably under production conditions — with known and documented performance characteristics, not aspirational ones. Off-the-shelf models are useful starting points. Transfer learning from pre-trained architectures (ResNet, EfficientNet, Vision Transformers) reduces training time and data requirements. The failure is not in using these architectures — it is in deploying them without production-representative evaluation, without domain-specific fine-tuning, and without monitoring infrastructure that detects when production conditions drift away from training conditions. Evidence and source notes 10–20 percentage point accuracy drop in production vs evaluation — observed consistently across our production CV deployments where training data did not match production conditions (lighting, class distribution, domain-specific variation). 97–99% of manufacturing units are defect-free, creating extreme class imbalance that shifts precision-recall trade-offs far from benchmark evaluation conditions — standard observation in manufacturing QC literature and our deployment experience. False-positive rate acceptable at 1% in evaluation becomes operationally costly at production scale — consequence of applying balanced-dataset thresholds to imbalanced production distributions (an observed pattern across our CV engagements; millions of units/month). Held-out test sets share the same distribution as training data — fundamental limitation of standard ML evaluation methodology; production deployment requires evaluation against the deployment distribution, not the training distribution. Silent high-confidence failures on out-of-distribution inputs — characterised through deliberate OOD testing in our production readiness assessments; models that fail silently (high confidence, wrong class) are more dangerous than models that fail visibly (low confidence). The gap between evaluation-set accuracy and production reliability is where most CV deployment surprises originate — a Production CV Readiness Assessment quantifies that gap before it becomes an operational cost.