Validation vs Verification in Pharma: Why the Distinction Matters for AI Systems

Two questions that sound the same but are not

Verification asks: “Did we build the system correctly?” It confirms that the software meets its design specifications — that every documented requirement has a corresponding implementation, and that every implementation produces the expected output under defined test conditions.

Validation asks a different question: “Did we build the correct system?” It confirms that the system, as implemented, meets the user’s actual needs and performs its intended function in the production environment with real users and real data. A system can pass verification and fail validation. A pharmaceutical manufacturing control system may correctly implement every documented requirement (verification passes), yet fail to prevent a specific class of process deviation that the requirements document did not anticipate (validation fails).

The two activities are complementary, not interchangeable. Treating them as a single gate is one of the more common reasons GxP audits surface findings on AI-enabled systems, and it is one of the structural reasons GAMP 5 keeps them separate in the V-model lifecycle.

The practical difference

Dimension	Verification	Validation
Question	Does it meet specifications?	Does it meet user needs?
Timing	During development	After deployment to production-representative environment
Evidence	Test results against specifications	Performance data in operational environment
Scope	Individual requirements, functions, modules	Entire system in production context
FDA reference	Design verification (21 CFR 820.30)	Process validation (21 CFR 211.100)
Failure mode	Specification gap or coding error	Requirements gap or environmental mismatch

In regulated pharmaceutical environments, both activities produce documented evidence — verification generates test protocols and execution records; validation generates qualification protocols (IQ/OQ/PQ) and summary reports. Regulatory inspectors review both, and they expect to see traceability from user requirements through verification artifacts into validation evidence.

How AI systems change the boundary

For traditional deterministic software, verification is relatively straightforward: define inputs, document expected outputs, run tests, compare results. If outputs match expectations within tolerance, verification passes. The work is bounded and the evidence is reproducible.

Machine learning models break this pattern. An ML model for pharmaceutical tablet inspection does not have a fixed input-output mapping. Its outputs depend on the model weights (which change with training data), the input distribution (which varies with production conditions), and the inference environment (CUDA versions, TensorRT engine builds, ONNX runtime configuration) which may differ between development and production. In our experience deploying vision models on the manufacturing floor, the environment-shift problem alone is enough to invalidate a naive verification regime — the model that passed every test on the development workstation can drift the moment it meets a different camera, a different lighting rig, or a different batch of product.

Verification of an ML model in a GxP setting must therefore address at least three layers:

Architecture verification — does the model architecture (layer types, parameter count, activation functions) match the design specification?
Training verification — was the model trained on the specified dataset version, with the specified hyperparameters, for the specified number of epochs, against the specified loss function?
Performance verification — does the model achieve the specified accuracy, precision, and recall on a frozen, held-out test set?

Validation for the same ML model addresses a different set of questions:

Operational performance — does the model maintain its test-set performance when deployed against real-time data in the production environment?
Robustness — does the model perform acceptably across the range of conditions it will encounter (lighting variation, product variation, equipment aging, seasonal raw-material changes)?
Drift detection — will the monitoring system detect, and alert on, the point at which model performance degrades below the documented acceptance threshold?

The distinction between CSA and full CSV approaches determines how these activities are documented and how much effort is proportionate for a given system’s risk class. Under the GAMP 5 Second Edition framework, an AI/ML component does not automatically inherit Category 5 (custom) status across both verification and validation — the deterministic infrastructure around the model often classifies separately from the model itself.

Why is continuous validation needed for AI/ML?

Traditional CSV treats validation as a one-shot activity: validate at release, revalidate at change. That model assumes the system under test is static between change events. ML models violate the assumption directly. A model whose training data is refreshed monthly, or whose inference behaviour depends on a feature distribution that shifts with the production line, is not the same system from one quarter to the next, even when no code has changed.

Continuous validation is the structural answer. Rather than treating validation as an event, it treats validation as an ongoing measurement programme: representative samples are scored against ground truth on a defined cadence, performance metrics are tracked against the acceptance thresholds set during initial qualification, and revalidation triggers are tied to measurable drift rather than to calendar dates. This pattern has been emerging across MLOps practice for several years and is now being codified in ISPE’s AI maturity guidance.

The mechanical implication is that the verification/validation evidence trail does not close at go-live. It continues for the life of the model, and the monitoring infrastructure (drift detectors, performance dashboards, alert routing into the quality system) becomes a validated subsystem in its own right.

Getting the sequence right

Verification before validation is not just a best practice — it is a logical dependency. Validating a system that has not been verified means testing whether the system meets user needs without first confirming that it was built correctly. If validation fails, you cannot determine whether the failure is a requirements gap (a validation problem) or an implementation error (a verification problem) without going back to verify first. Inspectors notice this, and a tangled verification/validation record is one of the cheaper findings for an auditor to write.

For AI systems in pharmaceutical manufacturing, the sequence we run is: verify the training pipeline → verify model architecture and the trained-model performance on a frozen test set → validate in a production-representative environment → deploy with continuous monitoring → revalidate when drift, retraining, or scope change is detected. Each step produces documented evidence that feeds the regulatory submission package, and the continuous-monitoring layer produces the evidence that keeps the qualification live between formal revalidations.

How do verification and validation apply differently to AI systems?

For traditional software, verification confirms that each module performs its specified function (unit tests, integration tests, code reviews), and validation confirms that the complete system meets user needs (user acceptance testing, performance qualification). The distinction is clear because traditional software has deterministic specifications that can be verified component by component.

AI systems complicate this distinction in a specific way. The ML model does not have individual function specifications in the traditional sense — it has training objectives and aggregate performance metrics. Verifying that the model was trained correctly (correct data version, correct hyperparameters, correct evaluation procedure) is tractable and looks like classical configuration management. Verifying that the model produces correct outputs for specific inputs, however, requires a test dataset with known-correct labels — which is structurally closer to validation than to verification, even though it lives inside the verification phase of the V-model.

Our approach in GxP engagements is to separate the AI system into two qualification domains. The deterministic software infrastructure — data pipelines, API endpoints, user interfaces, audit-trail logging, role-based access control — undergoes traditional verification and validation against documented specifications. The ML model undergoes performance qualification: a structured evaluation on a representative test dataset, governed by acceptance criteria defined before testing (not adjusted after seeing results), with a test set strictly independent of the training data, and statistical analysis of the relevant performance metrics with documented confidence intervals.

This dual-domain approach satisfies regulatory expectations because inspectors see the familiar V-model applied to the software components, supplemented by a performance qualification study for the AI-specific component. The performance qualification record is structured to map cleanly onto the GAMP 5 evidence expectations: design specifications for the model (architecture, training data, acceptance criteria), execution records (training run logs, evaluation results), and a qualification summary that an inspector can read without needing to understand the loss function.

FAQ

How is AI/ML software classified under GAMP 5 — Category 3, 4, 5, or something new?

GAMP 5 Second Edition does not introduce a new category specifically for AI/ML. Instead, classification follows the same Category 3 / 4 / 5 logic applied to the deterministic infrastructure, while the ML model itself is treated as a configured or custom component depending on how it was built. A vendor model used as-is leans Category 3 with model-specific qualification; a fine-tuned model on proprietary data typically lands at Category 4 or 5 with performance qualification layered on top.

What does a GAMP 5 validation lifecycle look like for a continuously-retrained AI model?

The V-model is preserved but extended. Initial qualification (IQ/OQ/PQ) happens at first deployment, and the retraining pipeline itself becomes a validated subsystem. Each retraining event triggers a controlled requalification of the new model against frozen acceptance criteria, supported by continuous monitoring of operational performance against those same criteria between formal events.

Why is continuous validation needed for AI/ML, and how does it differ from one-shot validation?

One-shot validation assumes the system is static between formal changes. ML models are not static — their behaviour can shift as input distributions drift or as retraining introduces new weights. Continuous validation replaces the assumption of stasis with an ongoing measurement programme: drift detection, performance tracking against thresholds, and revalidation triggers tied to measured behaviour rather than calendar cadence.

What evidence is required at each GAMP 5 V-model phase when the system under test is a model?

User requirements name the intended use, the operating envelope, and the acceptance thresholds. Functional and design specifications describe the model architecture, training data lineage, and inference environment. Verification produces architecture, training, and performance evidence on a frozen test set. Validation (PQ) produces operational performance evidence in the production-representative environment, plus the monitoring configuration that keeps the qualification live.

How do GAMP 5’s risk-based controls map onto AI-specific risks like data drift, hallucination, and training-data quality?

Risk assessment is done against the patient-safety, product-quality, and data-integrity axes that GAMP 5 already uses; AI-specific failure modes are added as risk sources under those axes. Data drift maps onto data-integrity and product-quality risk; hallucination in generative components maps onto patient-safety risk where outputs feed clinical or release decisions; training-data quality maps onto a combination of data-integrity and process-validation risk. The control set (data lineage, drift monitoring, human-in-the-loop review, model versioning) is selected proportionally to the assessed risk.

Where does the ISPE GAMP AI guidance change the classic GAMP 5 categorisation for ML software?

ISPE’s AI guidance keeps the Category 3/4/5 frame but layers an AI maturity assessment on top, recognising that the validation burden is shaped as much by how the model evolves as by how it was originally built. The practical effect is that a static, vendor-supplied model can be qualified more lightly than a continuously-retrained custom model even when both look like Category 5 under a strict reading of the original GAMP 5 text.