GAMP Software Categories Explained: What Each Category Means for Pharma Validation

Four categories, four levels of validation effort

The GAMP 5 framework classifies pharmaceutical software into categories that determine the appropriate level of validation effort. Classification turns on two factors: configurability (how much the user can customise behaviour) and the amount of custom development involved. Higher categories require more documentation and more thorough testing — not because the software is more important, but because the burden of demonstrating control shifts from the vendor to the user.

Category 2 (firmware) was removed in the current GAMP 5 edition. Firmware is now classified under Category 1 (infrastructure) or Category 3 (non-configured products) depending on its role.

Category	What it covers	Validation effort	Typical AI/ML mapping
1	Infrastructure (OS, DB, runtime, drivers)	IQ only; rely on vendor evidence	CUDA driver, container runtime, ML framework binaries
3	Non-configured commercial products	Verify intended use; test critical functions	Pre-trained foundation model used as delivered
4	Configured commercial products	Document configuration; test configured paths	LIMS / MES with GxP modules; configured AutoML platform
5	Custom-developed software	Full lifecycle validation with full traceability	Training pipeline, inference service, facility-specific model weights

Category 1 — Infrastructure software

Infrastructure software provides the computing environment on which GxP applications run: operating systems, database management systems, middleware, virtualisation platforms, network firmware. Category 1 systems are not GxP applications themselves; they support GxP applications.

Validation approach. Installation qualification (IQ) — verify correct installation and configuration. No functional testing of the infrastructure software itself; the vendor is responsible for that. Document the version, patch level, and configuration settings, then move on.

Examples. Windows Server, Red Hat Enterprise Linux, Oracle Database, VMware ESXi, Docker runtime, NVIDIA CUDA drivers.

Category 3 — Non-configured products

Software used exactly as delivered by the vendor, without user configuration of business logic or workflow. The user installs it and uses it inside its intended purpose.

Validation approach. Verify the software is used within its declared scope. Review vendor documentation. Test the critical functions relevant to your intended use. Do not test generic vendor functionality that is not part of your application — that is wasted effort and creates evidence the auditor will not ask for.

Examples. Scientific calculators, standard analytical instruments with embedded firmware, simple data loggers, reference databases.

Category 4 — Configured products

Commercial software configured by the user to meet specific business requirements. The software functionality exists in the product; the user enables, configures, or customises features to match the process.

Validation approach. Document the configuration. Test the configured features against requirements. Verify that configuration changes produce the intended behaviour. Leverage vendor testing for core functionality so user testing can focus on the configured aspects — that proportionality is the whole point of having a Category 4 in the first place.

Examples. SAP GxP modules, LabWare LIMS, Emerson DeltaV, Siemens SIMATIC, Veeva Vault Quality.

Category 5 — Custom applications

Software developed specifically for the intended use. This covers both fully custom-built applications and significant customisations to commercial platforms that involve writing new code.

Validation approach. Full lifecycle validation — requirements specification, design documentation, code review, unit testing, integration testing, system testing, user acceptance testing. Complete traceability from requirements through to test evidence. Any code path that affects a GxP outcome must be reachable from a traced test.

Examples. Custom manufacturing control systems, bespoke analytical data processing tools, ML models trained on facility-specific data, custom integration middleware.

Why classification is a system problem, not a product problem

A common error is to ask “what category is this software?” as if the answer were a property of the box on the shelf. It is not. Classification is a property of the system the software participates in — and most GxP-relevant AI/ML systems are multi-component, with parts that legitimately sit in different categories at the same time.

A practical example. A computer-vision system that inspects vials on a fill line might involve: NVIDIA CUDA drivers and a container runtime (Category 1); a pre-trained convolutional backbone used as delivered, with no weight changes (Category 3); a commercial MLOps platform configured with your data sources, scheduling policies, and approval gates (Category 4); and a head model fine-tuned on facility-specific images plus the inference service that decides accept/reject (Category 5). Calling the whole thing “a Category 5 system” is technically defensible but operationally wasteful — every component then inherits full custom-software validation effort that the lower-risk components did not need.

In our experience supporting GxP audits, inspectors are comfortable with component-level classification provided the boundaries are documented and the interfaces between components are themselves treated as testable surfaces. The classification artefact becomes a map: each box names its category, its vendor evidence, and its validation deliverables.

What does each GAMP category mean for ML and AI software?

ML models trained on company data are Category 5 custom applications — even if they are built on a commercial framework like PyTorch or TensorFlow. The framework binaries are Category 1 infrastructure. A pre-trained model architecture used as delivered (no further training) can sit in Category 3. The training pipeline, the training data handling, and the resulting model weights are Category 5 because they exist only because you built them.

The deeper challenge is that traditional GAMP categorisation assumes deterministic, fully specified software — Category 5 validation is built around testing every declared requirement, and Category 4 testing assumes the configuration determines the output. ML systems produce outputs that depend on training data and may shift with retraining. Two strategies follow from this. First, draw the box around the deterministic surfaces: the data pipeline, the inference service, the decision logic, and the model registry are conventional Category 5 software and validate cleanly with standard methods. Second, validate the model itself through performance qualification — demonstrate against predefined acceptance criteria (accuracy, sensitivity, specificity, calibration) on a representative test dataset, and freeze that evidence to a specific model version.

A deeper treatment of how the GAMP 5 Second Edition and the emerging ISPE AI guidance reframe these decisions is covered in the GAMP 5 classification guide for AI/ML software, which addresses multi-category classification and continuous validation for non-deterministic systems.

How does change control work when the model retrains?

Change control is where the category answer stops being academic. For a Category 4 LIMS, a configuration change triggers a documented configuration test. For a Category 5 custom service, a code change triggers regression testing tied to the requirements traceability matrix. For an ML model that retrains, neither pattern fits cleanly — the “change” is a new set of weights produced by a pipeline, not a human edit.

The pattern that has held up under audit is to treat retraining as a controlled release process. The training pipeline is Category 5 software with its own validation. Each model version is registered with its training data manifest, hyperparameters, and validation metrics. Promotion of a new version to production is gated on a predefined acceptance test — the model is deployed only if it meets the criteria on a held-out validation set the production team controls. The gate itself is auditable: who approved the release, against which criteria, with which evidence.

This is the part where the difference between Category 4 and Category 5 stops being a documentation argument and becomes an operational one. Category 4 thinking encourages teams to view the MLOps platform as the system of record — “the platform handles retraining”. Category 5 thinking forces the question: what did the platform actually do, against which criteria, with what human approval? In GxP environments the second framing is the one the auditor will reach for.

The practical rule

Classify accurately. Validate proportionately. A Category 4 system validated as Category 5 wastes engineering resources on evidence no one will read. A Category 5 system validated as Category 4 creates regulatory exposure the configuration record cannot cover. The purpose of classification is to determine the right level of effort — not the maximum level of effort, and not the minimum that fits the budget.

Two heuristics we use when reviewing classification decisions with clients. First, if the answer to “who wrote this code?” is “we did, for this facility”, it is Category 5, regardless of which commercial platform it runs on. Second, if a software component’s behaviour can be changed without writing code — by editing configuration, by uploading new reference data, by toggling features — its validation evidence must cover the configuration surface, not just one configured state. That is the test that separates a clean Category 4 from a Category 4 that should have been split into Category 4 + Category 5 components.