A radiologist asks a fair question of every imaging-AI vendor demo: “How do I know this holds on my scanners, my patients, my edge cases?” The honest answer is that a polished accuracy number on a slide tells you almost nothing about whether the model holds in a clinic. A clinical-grade validation engagement is the structured work that turns “the model scored 0.94 on a test set” into a defensible claim about a specific intended use, on a specific population, under specific conditions — with an audit trail that survives a regulator’s reading. That is the gap most teams underestimate. The engineering to train a competent segmentation or classification model is often the smaller half of the problem. The larger half is establishing that the model does what you say it does, only what you say it does, and that you can prove it. Validation is not a phase you bolt onto the end. It is the spine the whole project hangs from. Why “It Works in the Notebook” Is Not Validation The most common failure class we see is silent population drift between the development set and the deployment reality. A model trained and evaluated on curated, well-labelled images from two or three sites can collapse on a fourth site with a different scanner vendor, a different reconstruction kernel, a different patient demographic, or a different acquisition protocol. Nothing in the notebook metrics warns you. The accuracy number was real — it was just answering a question nobody in the clinic was asking. The reframe is that a validation claim is always scoped. It is never “this model is accurate.” It is “this model achieves sensitivity within these bounds, for this finding, on images acquired under these conditions, for this population, with this human-in-the-loop arrangement.” Everything outside that scope is undefined behaviour. A clinical-grade engagement spends much of its effort defining and defending that scope, because the scope is the product claim. This is also where imaging AI diverges sharply from general software validation. A deterministic computerised system either computes the right value or it does not, and you can test boundary conditions exhaustively. A learned model produces a probability over an effectively infinite input space. You cannot enumerate the inputs, so you validate against representative, stratified evidence and you bound the conditions under which the evidence holds. The discipline that governs this for regulated software is closer to how AI/ML software is classified and validated under GAMP 5 than to classic unit testing. The Five Pillars of a Clinical-Grade Imaging Validation Engagement A validation engagement that holds under scrutiny rests on five interlocking pillars. Skip any one and the others lose their footing — an immaculate performance study built on undocumented ground truth proves nothing, and a perfect audit trail over an unscoped intended use documents the wrong thing precisely. Pillar Core question it answers What “done” looks like Failure if skipped Intended use definition What clinical task, what finding, what population, what human role? A written, bounded statement of use that the performance study is designed to test Validation tests the wrong thing; scope creep in the field Data lineage & representativeness Where did every image come from, and does the set reflect deployment reality? Documented provenance, scanner/site/demographic stratification, declared exclusions Population drift; the curated-set illusion Ground truth establishment How was each label decided, and how reliable is it? Defined adjudication protocol, inter-reader agreement, reference standard rationale You measure agreement with noise, not truth Performance characterisation How does the model perform, with what uncertainty, across which strata? Stratified metrics with confidence intervals, failure-mode analysis, operating point rationale A single headline number that hides where it breaks Audit trail & change control Can you reconstruct every decision and detect drift later? Versioned data, model, and protocol; documented decisions; monitoring plan Unprovable claims; no defence under inspection Each pillar is a body of work, not a checkbox. The sections below take the three that teams most often underestimate. How Do You Establish Ground Truth You Can Defend? Ground truth is where many imaging projects quietly lose their footing. The seductive shortcut is to treat existing labels — a single reading, a discharge code, a prior report — as truth. In practice those labels carry the variance of the human or process that produced them, and a model trained to match noisy labels learns the noise. A defensible ground truth process names its reference standard and justifies it. For some findings the reference is histopathology or a follow-up outcome; for others it is a panel of readers with a defined adjudication rule when they disagree. In our experience, the single most informative artefact a team can produce is the inter-reader agreement statistic on the same cases — because it sets the ceiling on what the model can credibly claim (observed pattern across imaging engagements; not a published benchmark). If three radiologists agree only moderately on a finding, a model reporting near-perfect agreement with one reader is almost certainly fitting that reader’s idiosyncrasies, not the disease. The practical consequence: the ground truth protocol must be written and locked before the performance study, not reverse-engineered to make the numbers look good. That ordering is the difference between evidence and storytelling. What Does Honest Performance Characterisation Require? A single accuracy or AUC figure is a marketing artefact, not a validation result. Clinical-grade characterisation does four things a slide deck rarely does. First, it reports performance with uncertainty — confidence intervals scaled to the actual sample size, not a point estimate from a few hundred cases dressed up as certainty. Second, it stratifies: performance is broken out by scanner vendor, site, disease subtype, patient demographic, and any axis where the population is heterogeneous, because aggregate numbers routinely hide a subgroup where the model fails. Third, it analyses failure modes directly — not just how often the model is wrong but when and why, because a model that fails on rare-but-critical findings is more dangerous than its overall error rate suggests. Fourth, it justifies the operating point: the chosen threshold trades sensitivity against specificity, and that trade-off has clinical consequences that belong in the validation record, not in a config file. This is also where the boundary between engineering validation and regulatory clearance becomes concrete. Establishing that a model performs within bounds is engineering work; establishing that it is safe and effective for marketing as a medical device is a regulatory determination. We draw that line carefully — see where validation ends and clearance begins and, for the imaging-specific version, where engineering validation stops under FDA medical device regulation. A validation engagement produces the evidence; it does not, by itself, make a device claim. Why Data Lineage Decides Whether the Whole Study Means Anything Representativeness is the quiet determinant of whether a beautiful performance study generalises. If the validation set was drawn from the same three sites as training, the study measures memorisation of those sites’ characteristics as much as it measures clinical capability. A clinical-grade engagement documents provenance for every image — acquisition site, scanner make and model, reconstruction parameters, date range — and stratifies the validation set so that deployment conditions are actually represented. The tooling here is unglamorous but decisive: dataset versioning so that “the validation set” is an immutable, hash-identified artefact; a manifest linking each case to its provenance and its ground-truth decision; and a clear record of exclusions, because what you removed from the set shapes the claim as much as what you kept. Frameworks like PyTorch or MONAI handle the modelling; the lineage discipline lives in the data layer and the documentation around it, and it is where DICOM metadata, de-identification, and protocol normalisation all have to be reconciled before a single metric is computed. A Worked Scoping Example Consider a model intended to flag suspected intracranial haemorrhage on non-contrast head CT, as a prioritisation aid with a radiologist always in the loop. A clinical-grade engagement would, for example, fix the intended use to that finding, that modality, and that triage role — explicitly not diagnosis and not other findings. It would assemble a validation set stratified across at least the major scanner vendors and a demographic spread reflecting the target sites, with provenance recorded per case. Ground truth would be set by a reading panel with a defined adjudication rule and reported inter-reader agreement. Performance would be reported as stratified sensitivity and specificity with confidence intervals at a justified operating point tuned for high sensitivity given the triage role, plus a failure-mode analysis of the missed cases. Every artefact — data manifest, model version, protocol — would be versioned and traceable. The numbers above are illustrative framing for the shape of the work, not results from a specific study — but the shape is the one our clinical-grade life-sciences AI validation engagements are built around. The point is the structure: each design choice traces back to the intended-use statement, and the intended-use statement is what the regulator and the clinic both read first. This same evidentiary discipline underpins how computer-aided diagnosis software holds or fails on the validation question, which is the adjacent territory where a flagging aid becomes a diagnostic claim. FAQ How is validating a medical imaging AI model different from validating ordinary software? Deterministic software can be tested exhaustively at its boundaries because it computes the same output for the same input. A learned imaging model produces probabilities over an effectively infinite input space, so you cannot enumerate cases. Instead you validate against representative, stratified evidence and explicitly bound the conditions — scanner, population, acquisition protocol, human role — under which the claim holds. Why isn’t a high accuracy or AUC score enough to call a model validated? A single headline number reports aggregate performance on one dataset and hides where the model fails. Clinical-grade characterisation requires uncertainty bounds scaled to the sample size, stratification by scanner, site, and subgroup, direct failure-mode analysis, and a justified operating point — because a model that misses rare-but-critical findings can be more dangerous than its overall error rate implies. What makes ground truth defensible in an imaging validation study? A defensible process names and justifies its reference standard — histopathology, follow-up outcome, or an adjudicated reader panel with a defined disagreement rule — rather than treating existing single readings as truth. It reports inter-reader agreement, which sets the ceiling on what the model can credibly claim, and it locks the ground-truth protocol before the performance study rather than reverse-engineering it. Where does engineering validation end and regulatory clearance begin? Engineering validation establishes that a model performs within stated bounds on representative evidence with a complete audit trail. Regulatory clearance is the separate determination that the device is safe and effective to market for a stated intended use. A validation engagement produces the evidence regulators read; it does not by itself constitute a device claim. The harder question, once the five pillars are in place, is the one that outlives the engagement: how will you know when the model has drifted away from the population it was validated against? A validation study is a snapshot, and clinical populations, scanners, and protocols all move. The teams that hold up under inspection are the ones that treat the validation record not as a finish line but as the baseline against which every later month of deployment is measured.