A release reviewer opens your perception validation package, scans for the answer to one question — “on which conditions does this model degrade, and by how much?” — and finds a backlog of test tickets instead. That gap is the difference between a single sign-off round and weeks of back-and-forth. The package that gets signed is engineered to the reviewer’s questions. The package that stalls is engineered to the team’s test-tracking conventions, and the two structures rarely line up. This is the core mistake a perception team makes when a safety-relevant release comes up for review: it treats its internal test backlog as the evidence package. The backlog is organised the way the team works — by sprint, by component, by the order bugs were found. The reviewer — whether internal QA, an OEM customer, or eventually a regulator — reads in a completely different order. They are not auditing your process. They are deciding whether to put their name behind a claim about how the system behaves in the world. A validation package that answers that decision, section by section, is a different artefact from a tidy export of your test management tool. What a Perception Validation Package Contains That an Internal Test Report Does Not An internal test report tells you what was tested and whether it passed. A perception validation package tells a reviewer what they are being asked to sign against, what evidence stands behind each claim, and where the boundaries of the claim sit. The structural difference is that the package is organised around assertions about the deployed system, and each assertion carries its supporting evidence inline rather than pointing back into a ticket tracker. Concretely, a package built for review carries the operational design domain it claims to cover — the conditions, scenarios, and edge cases the perception stack is asserted to handle — and, just as importantly, the conditions it explicitly does not. It carries the degradation profile: where accuracy falls off, by how much, and against which perturbations. A robustness audit feeds this directly; what a perception robustness audit tests before you stake a release on your model covers the input side of that evidence. The package carries the dataset provenance, the metric definitions used (because “accuracy” without a defined detection threshold and IoU criterion is not a reviewable claim), and the residual-risk statement that names what is left unhandled and why that residual is acceptable. The internal test report has most of this data somewhere. The package is the act of arranging it so the reviewer never has to reconstruct the argument themselves. That arrangement is the work, and it is the work that a backlog export skips. How Package Sections Are Engineered to a Reviewer’s Questions Rather Than to the Test Backlog The reviewer arrives with a fixed set of questions, and they are remarkably stable across OEM programmes. Does the system do what it claims? Under which conditions does it stop doing that? How do you know? What happens when it fails? Who is accountable for the residual risk? A package engineered to those questions assigns one section to each, and orders the sections the way the reviewer thinks — claim first, evidence second, boundary third. A backlog, by contrast, is engineered to the team’s question: what is left to do. Those are orthogonal structures. When you hand a reviewer a backlog and ask them to extract a sign-off decision from it, you are asking them to do the engineering work you skipped — to read every ticket, infer the claim, and assemble the evidence chain in their own head. In our experience across reliability engagements, this is where sign-off cycles balloon (an observed pattern across packages we have reviewed, not a benchmarked figure): the reviewer’s first round is almost entirely “where is the evidence for X,” and every one of those questions is a round-trip. The reframe is simple to state and hard to internalise: the validation package is a product whose user is the reviewer, and its specification is the reviewer’s question list. Build it to that spec and the first review round can close in one pass. Build it to the test tracker’s schema and you re-pay the same justification work every release. What Evidence Does Each Section Need Behind It for an OEM Reviewer? An OEM reviewer does not accept a claim because the package asserts it. Each section needs evidence that is traceable to a method and reproducible from a stated dataset. The table below is the minimum evidence backing we would expect each section of a perception package to carry before it goes to an OEM. Package section Reviewer question it answers Evidence required behind it Evidence class Operational design domain What is this system claimed to handle? Scenario catalogue + the conditions explicitly excluded observed-pattern Performance claims Does it do what it claims? Metric definitions + measured results on a named, versioned test set benchmark Robustness / degradation Where does it stop working? Perturbation suite results, degradation curve per condition benchmark Failure behaviour What happens when it fails? Failure-mode catalogue + fallback/handover behaviour evidence observed-pattern Residual risk What is left unhandled, and is that acceptable? Named residual list + acceptability rationale + accountable owner observed-pattern Traceability How do I verify any of the above? Dataset provenance, version pins, reproduction instructions benchmark Two of these are non-negotiable for an OEM and are exactly the two a backlog export tends to lack: the explicit exclusions in the operational design domain, and the named residual-risk owner. A reviewer cannot sign against a claim whose boundary is implicit. Naming what the system does not cover is not an admission of weakness — it is what makes the positive claim signable. How the Package Travels Across Model Updates Without Being Rewritten The reason a reviewer-shaped package pays for itself is that its structure is invariant while the model is not. The section list — operational design domain, performance, robustness, failure behaviour, residual risk, traceability — does not change when you retrain. What changes is the evidence inside each section. If the package is engineered to the reviewer’s questions, a model update is a re-population of evidence into a stable frame, and the next review is a delta review: what moved, in which section, and by how much. A backlog-shaped package has no such invariant. Every release reorganises around whatever was worked on that cycle, so the reviewer cannot diff it against the last version — they re-read it from scratch. That is the same justification work, re-paid every release, and it is the single largest hidden cost of getting the package structure wrong. The discipline here is the same one that governs any reliability artefact: what a production AI monitoring harness actually contains makes the same argument for the runtime side — the artefact’s value comes from its structure being stable enough to compare against itself over time. This is also why the package and the engineering that produces it should not be conflated. The evidence-package build process — how you assemble the dataset provenance, define the metrics, and run the perturbation suite — is covered in how to build a perception validation evidence package that reviewers trust. This article is about the artefact’s shape and audience; that one is about producing the contents. Where Does the Perception Package End and a Regulatory Safety Case Begin? The perception validation package is an engineering artefact about one component: the perception stack. It bounds and substantiates claims about detection, classification, and degradation. A regulatory safety case is a system-level argument — it integrates perception with planning, control, redundancy, and the operational concept, and it argues that the whole vehicle function is acceptably safe. The package feeds the safety case; it does not constitute it. Getting this boundary wrong cuts both ways. A perception team that tries to write the safety case oversteps its component scope. A team that hands over a backlog and expects the OEM’s safety engineers to fill the perception gap underdelivers, and the safety case stalls waiting on perception evidence that should have arrived structured. When the customer is an OEM whose review is governance-grade, the package needs to slot cleanly into that larger argument — the principles for that are in approval-grade evidence: engineering AI for audit, procurement, and regulated review. The clean handoff is the package presenting perception claims, boundaries, and residual risk in a form the safety engineer can lift directly into the system argument. How Does the Package Map onto the ADAS Automation Level It Is Released For? Reviewer expectations are not constant across automation levels, and a package built for one level does not automatically satisfy the reviewer for the next. At Level 2, the human driver is the fallback, so the reviewer is largely concerned with whether the perception claims and their boundaries are honestly stated — the residual risk lands on a supervising driver. At Level 3, the system itself must handle the dynamic driving task within its operational design domain and manage the handover, so the reviewer scrutinises the failure-behaviour and handover sections far more heavily, and the residual-risk section must be defensible without a human backstop inside the domain. The practical consequence: the section list is invariant across levels, but the evidence bar per section rises sharply between L2 and L3, concentrated in failure behaviour and residual risk. A package that was signable at L2 will come back with questions at L3 precisely in those sections. Stating which level the package targets, up front, is part of engineering it to the reviewer’s question — because the reviewer’s question itself changes with the level. FAQ What does an automotive perception validation package contain that an internal test report does not? An internal test report records what was tested and whether it passed. The package adds the structure a reviewer needs to make a sign-off decision: the operational design domain with its explicit exclusions, a degradation profile, defined metrics, dataset provenance, and a residual-risk statement with a named owner. The data may exist in the report; the package is the act of arranging it around the reviewer’s decision. How are package sections engineered to a reviewer’s questions rather than to the test backlog? A backlog is organised by how the team works — sprint, component, bug order. A reviewer reads in a fixed order: claim, evidence, boundary, failure behaviour, accountability. The package assigns one section to each reviewer question and orders them the reviewer’s way, so the reviewer never has to reconstruct the argument from tickets. Treating the package as a product whose user is the reviewer is the reframe. What evidence does each section need behind it for an OEM reviewer? Each section needs evidence traceable to a method and reproducible from a named, versioned dataset — metric definitions for performance claims, a perturbation suite for robustness, a failure-mode catalogue for failure behaviour. Two backings an OEM treats as non-negotiable, and that backlog exports tend to lack, are the explicit exclusions in the operational design domain and a named residual-risk owner. How does the package travel across model updates without being rewritten? The section list is invariant; only the evidence inside each section changes when you retrain. That makes a model update a re-population into a stable frame and the next review a delta review — what moved, in which section, by how much. A backlog-shaped package has no invariant, so every release reorganises and the reviewer re-reads from scratch. Where does the perception package end and a regulatory safety case begin? The package is a component-level artefact substantiating claims about the perception stack. A safety case is a system-level argument integrating perception with planning, control, redundancy, and the operational concept to argue the whole vehicle function is acceptably safe. The package feeds the safety case but does not constitute it; the clean handoff is perception claims, boundaries, and residual risk in a form the safety engineer can lift directly. How does a perception validation package map onto the ADAS automation level the system is being released for? The section list stays the same across levels, but the evidence bar per section rises between L2 and L3, concentrated in failure behaviour and residual risk. At L2 the supervising driver is the fallback, so the reviewer mostly checks that claims and boundaries are honestly stated; at L3 the system handles the dynamic driving task and handover, so those sections face far heavier scrutiny without a human backstop. Stating the target level up front is part of engineering the package to the reviewer’s question. A perception team that wants its release to clear review in one round should build the package the way reliability artefacts are built everywhere on the production AI reliability line — structured around the question the reader is trying to answer, not the process that produced the data. The failure class to watch for is the backlog-as-package substitution: if your evidence is organised by sprint instead of by reviewer question, you have not built the artefact yet, only the raw material for it.