Software Audit, in Practice: What It Tests and Where AI Systems Differ

A software audit is supposed to answer one question: is this system sound enough to keep building on, ship, or acquire? For deterministic software that question has a clean answer. You scan the repository, inventory the dependencies, flag the known CVEs, check test coverage against the critical paths, and produce a ranked list of what to fix. The behaviour you tested is the behaviour you ship — so confirming the code is sound is, broadly, confirming the product is sound.

That equivalence quietly breaks the moment an AI or ML feature enters the system. A clean code audit can pass — no critical CVEs, healthy coverage, no obvious defects — while the deployed model silently drifts, hallucinates, or regresses on inputs no static check ever exercised. The audit confirmed the code. It said nothing about whether the behaviour the model produces in production is reliable. That gap is the subject of this article.

What a Software Audit Actually Tests

Most teams use “software audit” loosely, but the engagement has a recognisable shape regardless of who runs it. It examines four layers, and it is worth being precise about what each one proves.

Code quality and defect density. Static analysis, complexity hotspots, dead code, error-handling gaps. This tells you how maintainable the system is and where bugs cluster.
Dependency exposure. A software bill of materials, known-vulnerability scanning (CVE matching against your locked versions), license compliance, and the staleness of transitive dependencies. This tells you your inherited risk surface.
Test posture. Coverage against critical paths, the ratio of unit to integration tests, whether the suite actually exercises failure modes or just happy paths. This tells you how much of the system’s behaviour is pinned down.
Operational surface. How the system behaves once deployed: who owns failures, how regressions are detected, what the rollback path is, whether monitoring exists for the things that actually break.

The first three layers are where most audits stop, and for deterministic software that is reasonable. The fourth layer is where the expert version of the engagement lives — and it is the layer that decides whether the audit means anything for an AI-bearing system.

Here is the distinction that matters: a standard audit validates the artifact (the code, its dependencies, its tests). A complete audit validates the operational surface (how the running system behaves and how you find out when it stops behaving). For ordinary CRUD software those two converge. For AI systems they diverge sharply, because the model’s runtime behaviour is not pinned by the source — it is a function of inputs the audit never saw and data distributions that move after you sign off.

How Does a Software Audit Work in Practice?

A scoped audit runs in stages, and naming them upfront keeps the engagement from sprawling. The typical arc is: scope and access, automated sweep, manual review, operational examination, and a ranked remediation roadmap.

Scope and access fixes the boundary — which repositories, which services, which deployment environments, and what the buyer actually needs to decide (ship gate, acquisition due diligence, post-incident review). The automated sweep runs static analysis, dependency scanning, and coverage measurement. Manual review interprets the sweep: a CVE in a transitive dependency that is never reachable from your call graph is noise, while a medium-severity issue on your authentication path is not. The operational examination — often skipped — checks ownership, detection, and rollback. The output is a ranked, costed list of findings, not a wall of scanner output.

That last point is the entire value. We see teams treat the scanner report as the audit. A raw scanner dump is unranked, uncosted, and context-free; it tells you a thousand things are wrong without telling you which three matter this quarter. The audit’s job is to convert that noise into a remediation roadmap with effort estimates — defect density, dependency exposure, and coverage gaps each carrying a cost-to-fix and a cost-of-ignoring. Without that conversion, you have a scan, not an audit.

What Does a Software Audit Cost, and What Drives the Price?

Cost tracks scope, and scope tracks four variables more than anything else: codebase size, the number of distinct deployment surfaces, the depth of operational examination requested, and whether an AI feature is in play. A read-only audit of a single deterministic service is a different engagement from an acquisition-grade due-diligence pass across a dozen services with an embedded model.

The honest framing is that price is a function of what you need to decide, not lines of code. The table below sketches how the variables move scope. The cost figures are deliberately omitted — they are engagement-specific and any single number quoted in an article is misleading — but the drivers are stable across the work we have done in this area.

Scope driver	Low end	High end	Why it moves cost
Codebase size	Single service, one language	Polyglot monorepo, many services	Manual review time scales with surface, not just LOC
Deployment surfaces	One environment	Multi-region, multi-tenant	Each surface has its own operational risk
Operational depth	Code + dependencies only	Ownership, detection, rollback examined	Operational review is manual and judgement-heavy
AI feature present	None	Model in the critical path	Triggers eval-coverage, drift, and rollback auditing a standard audit never opens

(Scope-driver pattern observed across TechnoLynx audit engagements; not a published price benchmark.)

The single largest cost discontinuity is the last row. Adding “and audit the AI feature properly” is not a 10% uplift on the same checklist — it opens an entirely separate examination, because the questions are different in kind.

Where a Standard Software Audit Falls Short for AI Systems

This is the divergence point. A general software audit confirms the code is sound. It does not — and cannot, with its standard toolkit — confirm that the deployed AI behaviour is reliable. Three things slip straight through a clean audit:

Eval coverage. Static analysis and unit tests check that functions return correct types and handle errors. They do not check whether the model’s outputs are correct across the input distribution it will actually see. A model can pass every code-level test while producing wrong, biased, or unsafe predictions on inputs no test case represented. Auditing this means examining the evaluation set — its coverage, its representativeness, the gap between what was evaluated and what production sends.

Drift posture. Deterministic code does not change behaviour after deployment unless you redeploy it. A model’s effective behaviour changes when the input distribution moves, even though the weights are frozen. A standard audit has no concept of this; there is no CVE for “the world your model was trained on no longer exists.” The distinction between data drift and model drift, and how each changes your reliability response, is exactly the axis a code audit never touches.

Rollout ownership. When a deterministic service breaks, the failure is usually loud — an exception, a 500, a failed health check. When a model regresses, the failure is often silent: the system keeps returning confident answers that are quietly worse. Detecting that requires monitoring on behavioural metrics, a defined time-to-detect, and a clean time-to-rollback. A standard audit checks whether a rollback mechanism exists; it does not check whether anyone would know to use it.

For AI-bearing systems, the audit has to extend its ROI story accordingly. Instead of “improved code quality,” the buyer gets operational metrics — eval-coverage delta, time-to-detect, time-to-rollback — so they can quantify the avoided cost of an undetected production regression. That is a concrete number, not a vague quality claim. What that extended examination tests in full is the subject of what a production AI reliability audit actually tests across evals, drift, rollout, and ownership — it is the engagement a software audit should escalate into when a model sits in the critical path.

When to Replace or Extend a Standard Audit With a Reliability Audit

The decision is not subtle once you frame it correctly. Run the diagnostic below.

Is there a model in the critical path of a user-facing or revenue-bearing flow? If no, a standard software audit is sufficient. If yes, continue.
Does a silent quality regression cost more than a loud outage? For most AI features, yes — silent wrongness erodes trust before anyone files a ticket.
Can you currently state your time-to-detect for a model regression? If you cannot answer in a number, your audit must extend to drift posture.
Do you have eval coverage that resembles production input distribution? If unknown, that gap is the highest-value finding the audit will produce.

If the model is in the critical path and any of the last three answers is uncomfortable, the engagement should extend from code-soundness to behavioural reliability. The engineering deliverable of that extended work is not a code-review report — it is a production AI monitoring harness that operationalises eval coverage, drift detection, and rollback as standing infrastructure rather than a one-time finding.

A software audit also feeds a downstream gate. For systems carrying an AI feature, the audit’s findings flow into the release-readiness decision framework that decides when an AI feature is ready to ship — the audit supplies the evidence, the framework makes the call. The two are complementary: one examines, one gates.

FAQ

How does software audit work, and what does it mean in practice?

A software audit examines a system to answer whether it is sound enough to ship, build on, or acquire. In practice it runs in stages — scope and access, an automated sweep (static analysis, dependency scanning, coverage), manual interpretation of that sweep, an operational examination, and a ranked remediation roadmap. The deliverable is a costed list of prioritised findings, not a raw scanner dump.

What does a software audit actually test — code, dependencies, tests, or operational behaviour?

All four, in layers. It tests code quality and defect density, dependency exposure (CVEs, licenses, staleness), test posture (coverage and whether failure modes are exercised), and the operational surface (failure ownership, regression detection, rollback). Most audits stop at the first three; the operational layer is where the expert version of the engagement lives.

Where does a standard software audit fall short for systems with an AI or ML feature?

A clean code audit can pass while the deployed model drifts, hallucinates, or regresses on inputs no static check ever exercised. Standard audits validate the code artifact, but a model’s runtime behaviour is a function of input distributions that move after sign-off. Eval coverage, drift posture, and rollout ownership slip straight through a standard audit’s toolkit.

How much does a software audit cost, and what factors drive the scope and price?

Cost tracks scope, and scope tracks codebase size, the number of deployment surfaces, the depth of operational examination requested, and — most significantly — whether an AI feature is in the critical path. Adding proper AI-feature examination is a discontinuity, not a small uplift, because it opens a separate behavioural examination. Price is a function of what you need to decide, not lines of code.

When should a general software audit be replaced or extended by a production AI reliability audit?

When a model sits in the critical path of a user-facing or revenue-bearing flow, a silent quality regression costs more than a loud outage, and you cannot currently state your time-to-detect or your eval coverage. In that case the engagement should extend from code-soundness to behavioural reliability, with a monitoring harness as the engineering deliverable rather than a code-review report.

A Closing Note on Scope

The cleanest software audit you ever receive may still leave your most important risk unexamined — not because the auditor was careless, but because the toolkit ends where the model’s behaviour begins. The discipline is knowing which question you actually need answered before you scope the work. If a model decides anything that matters, the question is not “is the code sound?” but “would we know, in time, if the behaviour stopped being?” That second question is what reliability auditing exists to answer, and a general software and engineering audit through our services is the right place to draw the boundary between the two.