How to Classify and Validate AI/ML Software Under GAMP 5 in GxP Environments

GAMP 5 was not designed for software that learns

The original GAMP 5 framework (2008) classifies software into categories based on complexity and configurability. Category 1 is infrastructure software (operating systems, database engines). Category 3 is non-configured products used as-is. Category 4 is configured products — ERP systems, LIMS, MES configured for the specific facility. Category 5 is custom-developed software built specifically for the intended use. Each category carries a prescribed validation approach: lower categories require less testing; higher categories require more.

This classification assumes a fundamental property of traditional software: deterministic behaviour. The same input produces the same output, the behaviour is fully defined by the code, and the validation evidence from version 1.0 remains valid until someone changes the code. An ML model violates all three assumptions. It learns from data rather than being explicitly programmed. Its behaviour is shaped by the training dataset, not just the source code. And that behaviour changes every time the model is retrained on new data — which is the expected operational mode, not an exception.

The regulatory landscape reflects this shift. The FDA reports that over 1,000 AI/ML-enabled medical devices have received regulatory authorisation as of 2025 — per the FDA’s Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices list, updated October 2024 (published-survey) — with the majority requiring validation approaches beyond traditional GAMP 5 categories. ISPE reports that the GAMP 5 Second Edition (2022) is now the de facto validation framework across 40+ countries, with a Community of Practice of over 10,000 members (published-survey, ISPE 2024).

We have seen both outcomes. Forcing an ML model into Category 4 or Category 5 without acknowledging these differences produces one of two failures: a validation approach that tests the wrong properties (verifying deterministic input-output behaviour that the model was not designed to exhibit), or a revalidation burden so heavy that every model update triggers a months-long validation cycle that makes the system unmaintainable in practice. Across our pharma engagements, both failure modes appear with roughly equal frequency (observed-pattern, not a benchmarked rate).

The Second Edition reframe

The GAMP 5 Second Edition (2022) and the accompanying ISPE GAMP guidance for AI/ML systems address this gap directly. The core change is a shift from category-based validation (which type of software is this?) to risk-based validation (what is the impact if this system fails?).

For AI/ML systems, the Second Edition establishes several principles that the original framework did not accommodate.

Critical thinking over prescriptive testing. The Second Edition explicitly advocates “critical thinking” in validation planning — assessing what needs to be tested based on risk, rather than following a prescribed set of test types based on software category. For an ML model in a GxP environment, this means the validation plan should focus on the failure modes that matter (model drift, data distribution shift, adversarial inputs, performance degradation over time) rather than on verifying input-output pairs that a deterministic system would produce.

Unscripted testing as a valid approach. Traditional CSV relies heavily on scripted test cases: pre-defined inputs with expected outputs, executed and documented in traceability matrices. The Second Edition recognises that unscripted testing — exploratory testing, error-based testing, and scenario-based testing — is valid for moderate- and lower-risk systems. For ML models, unscripted testing is often more informative than scripted testing: exploring model behaviour at class boundaries, testing with adversarial or out-of-distribution inputs, and evaluating performance across data subsets (sliced evaluation) reveals weaknesses that scripted pass/fail tests would miss.

Continuous validation. The most significant departure from the original framework. Traditional validation is a point-in-time event: validate once, maintain through change control. ML models that are retrained on new data — the normal operating mode for production ML systems — require continuous validation: ongoing performance monitoring against documented acceptance criteria, with automated alerts when performance degrades. The GxP validation frameworks that accommodate AI must include monitoring infrastructure as a validation component, not as a post-validation operational concern.

How do you classify an AI/ML system under the current framework?

The practical classification of an AI/ML system under GAMP 5 Second Edition follows the risk-based approach rather than the category-based approach. The methodology has four steps, each producing a specific artifact that feeds the next.

Step 1: Define the intended use. What does the AI/ML system do in the GxP context? This must be specific: “The system classifies visual inspection images of sterile injectable products as pass or fail, with the classification used to support — but not replace — the human inspector’s release decision.” The intended use statement bounds the validation scope — the system is validated for what it is intended to do, not for everything it could theoretically do. A vague intended use statement (such as “the model assists quality decisions”) makes risk-proportionate validation impossible because the scope has no boundary.

Step 2: Assess the GxP impact. Using the three-dimension framework — product quality impact, patient safety impact, data integrity impact — classify the system’s GxP scope. This determines the overall risk tier and the proportionate validation intensity. A model that classifies images as a confirmatory check on a human inspector occupies a different risk tier than the same model deployed for autonomous release.

Step 3: Identify the ML-specific risks. Beyond the standard GxP risks that apply to any software system, ML systems introduce specific risk categories that must be assessed. Training data risk: is the training data representative of the production environment, is it labelled consistently, has it been audited for bias or gaps? Model drift risk: how quickly does the model’s performance degrade when the production data distribution changes, and what is the monitoring strategy for detecting drift? Retraining risk: when the model is retrained, how is the new version validated, and what acceptance criteria must the retrained model meet before it replaces the production version? Explainability risk: can the model’s decisions be understood well enough to investigate failures? For GxP-critical systems, the quality team must be able to determine why the model produced a specific output — not at the individual-weight level, but at the feature-importance or decision-boundary level.

Step 4: Design the validation approach proportionate to the risk. High-risk ML systems (direct GxP impact, autonomous decisions) receive comprehensive validation with documented acceptance criteria, scripted and unscripted testing, and mandatory continuous monitoring. Moderate-risk systems (supporting GxP decisions, with human oversight) receive risk-based testing focused on the ML-specific risks identified in Step 3. Low-risk systems (minimal GxP impact, fully mitigated by other controls) receive minimal validation — typically a documented risk assessment and performance verification against basic acceptance criteria.

Risk-to-validation mapping

GxP impact tier	Example	Validation intensity	Continuous monitoring
High — autonomous GxP decision	Model auto-releases batches without human review	Comprehensive scripted + unscripted, sliced evaluation, full traceability	Mandatory; drift + performance alerts with documented response protocol
Moderate — supports GxP decision with human oversight	Model flags inspection images for human reviewer	Risk-based testing on identified ML-specific failure modes	Mandatory; performance monitoring against acceptance criteria
Low — non-critical or fully mitigated by other controls	Model prioritises maintenance work orders	Documented risk assessment + basic acceptance criteria	Recommended; periodic review sufficient

The table is a starting point for classification conversations, not a substitute for the four-step methodology. The intended-use statement determines which row applies.

The ISPE AI maturity model

The ISPE GAMP guidance for AI/ML introduces a maturity model for pharmaceutical organisations adopting AI. The model is useful not as a prescriptive roadmap but as a diagnostic: it identifies where an organisation’s current practices have gaps relative to the regulatory expectations for AI in GxP environments.

The maturity levels relevant to validation are awareness, defined, and managed.

At awareness, the organisation recognises that AI/ML systems require different validation approaches than deterministic software, but has not yet developed policies or procedures. Most pharmaceutical companies that have deployed AI in non-GxP contexts (scheduling, supply chain) but not yet in GxP contexts are at this level. In our work with pharma organisations, this is the most common starting point (observed-pattern, across our engagements; not a benchmarked rate).

At defined, the organisation has developed policies for AI/ML validation — including risk assessment templates, acceptance criteria guidelines, and change control procedures for model retraining. The policies are documented but may not yet have been tested through a production GxP deployment.

At managed, the organisation has deployed AI/ML in GxP contexts using the defined policies, has validated at least one system through the full lifecycle, and has operational experience with continuous monitoring, drift detection, and model retraining under change control. This is the level at which the organisation has practical evidence — not just policy documents — that its AI validation approach works.

The practical value of the maturity model is in identifying the specific gaps between an organisation’s current state and the managed level. For organisations at the awareness level, the gap is policy development. For organisations at the defined level, the gap is operational experience — which is best acquired through a first deployment on a moderate-risk system where the validation effort is proportionate and the learning is transferable to higher-risk deployments later.

What a validated ML system looks like in practice

A production ML model operating in a GxP pharmaceutical environment with validated status carries a recognisable set of artifacts and controls. The documentation burden is proportionate to the risk, but the core elements are non-negotiable regardless of the risk tier.

The validation documentation includes the intended use statement, the risk assessment (incorporating ML-specific risks identified in Step 3), the validation plan specifying testing approach and acceptance criteria, test execution records covering both scripted and unscripted runs, and a validation summary report with documented pass/fail against the criteria.

The model artifacts sit under version control: the trained model itself (weights and architecture definition), the preprocessing pipeline (feature engineering, normalisation, augmentation logic), the training dataset or documented dataset provenance with reproducibility information, the hyperparameter configuration, and the evaluation metrics on the validation dataset. Tooling here is typically MLflow or DVC for experiment and artifact tracking, ONNX for portable model representation where the production runtime differs from the training environment, and standard Git-based change history for the surrounding code. All artifacts carry a traceable change history.

The continuous monitoring infrastructure is what separates a validated AI system from a validated traditional system. Automated performance tracking compares live metrics against the documented acceptance criteria — accuracy, precision, recall, and domain-specific metrics that map to the failure modes identified in the risk assessment. Statistical drift detection compares production data distribution to the training distribution, typically using population stability index, KL divergence, or Kolmogorov-Smirnov tests on the relevant features. Alert mechanisms fire on performance degradation or drift detection, and a documented response protocol defines who acts on what, within what window.

Change control for retraining is its own discipline. Every model retrain triggers a documented process that includes the rationale (new data availability, drift detection, expanded intended use), the training dataset for the new version, performance comparison between new and current production versions, acceptance criteria evaluation, and an approval workflow before the new version enters production. The retrain is treated as a controlled change, not as a maintenance task.

Finally, the audit trail: every model inference in the GxP context is logged with timestamp, model version, input data reference, output (prediction or classification), confidence score, and whether the output was accepted or overridden by a human operator. This is what regulatory auditors expect to find when they walk into the room.

30-day GAMP 5 AI/ML validation fast-start

A moderate-risk first deployment can move from policy gap to validated operational state in 30 days when the effort is structured around the risk-based methodology described above. The schedule below is what we recommend as a starting template — not a guarantee, since training data quality and existing change-control maturity can stretch or compress any of the weeks.

Week 1 — Risk classification and intended use definition. Write the intended use statement for the target AI/ML system, bounding the validation scope to what the system is intended to do. Complete the three-dimension GxP impact assessment (product quality, patient safety, data integrity). Identify the ML-specific risks: training data representativeness, model drift exposure, retraining frequency, and explainability requirements.
Week 2 — Validation planning and acceptance criteria. Design the risk-proportionate validation approach: define scripted test cases for high-risk failure modes and unscripted testing protocols for boundary exploration, adversarial inputs, and sliced evaluation across data subsets. Document acceptance criteria for accuracy, precision, recall, and domain-specific metrics. Draft the validation plan linking each test to the risks identified in Week 1.
Week 3 — Test execution and monitoring infrastructure. Execute the scripted and unscripted test protocols against the model. Deploy continuous monitoring: automated performance tracking against the documented acceptance criteria, statistical drift detection comparing production data distribution to training data distribution, and alert mechanisms for degradation. Configure the audit trail to log every inference with model version, input reference, output, confidence score, and human override status.
Week 4 — Change control, documentation, and operational handoff. Implement the change control procedure for model retraining: documented rationale, dataset provenance, performance comparison, acceptance criteria evaluation, and approval workflow. Compile the validation summary report with pass/fail results. Place all model artifacts (weights, preprocessing pipeline, hyperparameter configuration, training dataset provenance) under version control with traceable change history.

The methodology for getting from no ML validation experience to this operational state is best learned on a moderate-risk first deployment — one where the GxP impact is real but bounded, the validation effort produces transferable templates, and the continuous monitoring infrastructure becomes reusable across subsequent deployments. If your pharma AI use cases are identified but the validation pathway for the first GxP deployment is not yet defined, a GxP Regulatory Scope Analysis produces the classification and validation approach per system.

FAQ

How is AI/ML software classified under GAMP 5 — Category 3, 4, 5, or something new?

Under GAMP 5 Second Edition, AI/ML software is not forced into a single legacy category. The Second Edition shifts from category-based to risk-based classification: the system’s intended use, GxP impact (product quality, patient safety, data integrity), and ML-specific risks determine the validation intensity. A retrained ML model can carry traits of Category 5 (custom-developed) but requires controls that the original Category 5 prescription did not anticipate, particularly continuous monitoring and retraining change control.

What does a GAMP 5 validation lifecycle look like for a continuously-retrained AI model?

It extends the traditional V-model with continuous validation activities. Initial validation establishes intended use, risk assessment, acceptance criteria, and a baseline performance record under both scripted and unscripted testing. After release, the model is under continuous performance monitoring against the documented criteria, with statistical drift detection on the input distribution. Every retrain is a controlled change: documented rationale, new training dataset, performance comparison against the current production version, acceptance criteria evaluation, and approval before deployment.

Why is continuous validation needed for AI/ML, and how does it differ from one-shot validation?

Traditional validation is a point-in-time event: evidence collected once remains valid until the code changes. ML model behaviour changes whenever the training data changes or the production data distribution drifts away from the training distribution — neither of which require a code change. Continuous validation makes the monitoring infrastructure part of the validated system: documented acceptance criteria, automated tracking against those criteria, drift detection, and a defined response protocol when alerts fire.

What evidence is required at each GAMP 5 V-model phase when the system under test is a model?

The intended use statement and risk assessment replace the user requirements specification as the scope-bounding artifacts. Functional and design specifications cover the model architecture, preprocessing pipeline, training dataset provenance, and hyperparameter configuration. Test execution evidence includes scripted test results plus unscripted exploration of class boundaries, adversarial inputs, and sliced evaluation across data subsets. Operational evidence is the continuous monitoring record, the audit trail of inferences, and the change control history for retrains.

How do GAMP 5’s risk-based controls map onto AI-specific risks (data drift, hallucination, training-data quality)?

Each ML-specific risk maps to a control: training data quality maps to dataset provenance documentation, labelling consistency audits, and bias review; data drift maps to statistical monitoring of the production input distribution against the training distribution; hallucination or out-of-distribution failure maps to confidence-score thresholds, human-in-the-loop review for low-confidence outputs, and unscripted testing of edge cases. Each control links back to the risk in the validation plan, so an auditor can trace from a risk to the test and to the operational monitoring.

Where does the ISPE GAMP AI guidance change the classic GAMP 5 categorisation for ML software?

The ISPE GAMP guidance for AI/ML keeps the underlying risk-based principle of the Second Edition but adds AI-specific layers: an explicit maturity model (awareness, defined, managed), explicit recognition of training data as a validated artifact, and explicit treatment of continuous validation as a lifecycle requirement rather than a post-release operational concern. It does not replace the GAMP 5 categories; it reframes how those categories apply when the software learns.