What an AI Security Assessment Tests on Your RAG, Chatbot, or Agent

Your AI vendor’s SOC 2 report does not answer the question your enterprise security reviewer is actually asking. When a procurement reviewer probes a RAG, chatbot, or agent feature, the question is not whether the model provider has good controls — it is whether your deployment leaks data through retrieval, executes unsafe tool calls, or can be steered off-task by text the user (or a document) supplies. That is a property of the system you built, not the model you called. No vendor security marketing can vouch for it, because no vendor can see how you wired retrieval to your data, what actions you gave an agent, or which guardrails you actually enabled.

This is where most teams get caught. They treat the AI surface as a generic web application, point at the LLM provider’s compliance page, and assume the security review will move on. It does not. The first time the reviewer asks “what happens if a user pastes instructions that tell your assistant to ignore its system prompt and dump the retrieved context,” the deployment that leaned on vendor claims has no answer — and a procurement block lands after the feature is already built.

An AI security assessment is the defensive exercise that produces the answer in advance. It runs against your own deployed RAG, chatbot, or agent, exercises the attack surfaces that are specific to language-model systems, and outputs an evidence pack tailored to the AI layer. It is not penetration testing of your network, and it is not offensive red-teaming of someone else’s product. It is the security-side counterpart to the release-readiness gate that decides when an AI feature is ready to ship.

What Does an AI Security Assessment Cover Beyond Model Accuracy?

Accuracy and security are orthogonal. A RAG assistant can answer questions correctly 95% of the time and still hand a motivated user the contents of documents they were never authorised to see. The accuracy eval will not catch that, because the failure is not a wrong answer — it is a right answer to a question the system should have refused.

The assessment covers four exposure classes that a generic application security review tends to miss because they are properties of the language model’s behaviour, not of the surrounding code:

Prompt-injection — text in the user input, or in a retrieved document, that overrides the system instructions. Direct injection comes from the user; indirect injection arrives through content the model ingests (a web page, a PDF in the knowledge base, an email an agent reads). Indirect injection is the harder one, because the attacker never touches your front end.
Unsafe tool use — for agents, the action set matters more than the prose. If an agent can call a function that sends email, writes to a database, or hits an internal API, the assessment tests whether crafted input can make it call that function with attacker-chosen arguments.
Data leakage — whether the retrieval layer surfaces content across tenant or permission boundaries, whether the model regurgitates system prompts or embedded secrets, and whether conversation history bleeds between sessions.
Abuse-case exposure — the deployment-specific ways the feature can be turned to a purpose it was never scoped for: generating disallowed content, exhausting a paid downstream API, or extracting the proprietary instructions that encode your business logic.

The throughline is that the security of an AI feature lives in the integration, not in the model weights. That is the single claim worth carrying out of this article. The model is a component; the attack surface is the system you assembled around it.

How Do We Test Our Own RAG Deployment for Prompt-Injection and Unsafe Tool Use?

You test it the way an attacker would, against the running system, with the real retrieval index and the real tool bindings — not against a sanitised demo. A meaningful assessment instruments the deployment at the points where language crosses a trust boundary.

For prompt-injection, that means seeding the knowledge base with documents that carry adversarial instructions and observing whether retrieval-then-generation honours them. It means submitting inputs that attempt to escape the system prompt and measuring how often the guardrail holds across paraphrases — a single canned jailbreak string is not a test; a campaign of semantic variants is. Frameworks like Microsoft’s open-source PyRIT and the OWASP guidance on LLM application risks give a structured starting catalogue of attack patterns, but the value is in running them against your index and your prompts, because injection that fails on a generic chatbot can succeed against a system whose retrieved context is unusually instruction-like. A published static prompt set like AdvBench measures something narrower than this campaign — useful as a release-checklist input, not a substitute for testing against your own deployment.

For unsafe tool use, the test exercises the agent’s action set directly. If the agent uses a framework like LangChain or the OpenAI function-calling interface to invoke tools, the assessment enumerates every reachable function, then attempts to drive each one through adversarial input — checking both whether the call fires and whether the arguments can be attacker-controlled. The dangerous pattern we see regularly is an agent with a broad tool set and a narrow set of input validations, where the model is trusted to be the access-control layer. It should never be. (Observed across our engagements with agent deployments; not a published benchmark.)

A Minimal Assessment Checklist for a RAG or Agent Feature

Before a security reviewer asks, walk this list against the deployed surface:

Trust boundaries mapped — every place where user, retrieved, or tool-returned text enters the model context is identified and labelled by trust level.
Direct injection tested — system-prompt-override attempts run as a campaign of paraphrases, with a measured hold rate, not a single passing example.
Indirect injection tested — adversarial instructions planted in indexable content, with the retrieval-to-generation path observed end to end.
Tool authorisation checked — every agent action gated by a control outside the model; the model is not the access-control layer.
Data-boundary checks — retrieval cannot cross tenant or permission scopes; system prompt and secrets are not extractable.
Abuse cases enumerated — deployment-specific misuse scoped, with a documented decision for each (mitigated, accepted, or out of scope).
Logging and detection — the attack vectors above are observable in production telemetry, not silent.

The checklist is self-contained on purpose. A reviewer can read it, and so can your own team, before the review meeting rather than during it.

How Does the Assessment Differ From Generic Penetration Testing?

A traditional penetration test targets infrastructure and application code: open ports, injection into SQL, broken authentication, misconfigured storage. Those still matter, and a competent assessment assumes they are being handled by your existing security program. What a generic pen test does not cover is the model’s behaviour as an attack surface — because that surface did not exist in the threat models most pen-test methodologies were built around.

The distinction is worth stating in a table, because the two activities are complementary, not interchangeable.

Dimension	Generic penetration test	AI security assessment
Primary target	Network, infrastructure, application code	The language-model integration: prompts, retrieval, tool bindings
Core failure class	Injection into code/queries, auth bypass, misconfig	Prompt-injection, unsafe tool use, data leakage via retrieval
What “input” means	Structured requests, payloads	Natural-language text, including text inside retrieved documents
Access-control assumption	Enforced in code	Often (wrongly) delegated to the model
Evidence produced	Vuln findings against CVE-style classes	AI-surface findings the security reviewer cannot get elsewhere
Reproducibility	Deterministic for a given payload	Probabilistic — must measure hold rates across variants

The probabilistic row is the one that trips teams up. A code injection either works or it does not. A jailbreak might succeed one time in twenty, which means the assessment has to report a rate, and the deployment has to decide what rate is acceptable. That framing — measuring behaviour across a campaign rather than asserting a binary pass — is the same discipline that makes a task-specific LLM evaluation survive a procurement review. Security evals and quality evals share the same statistical backbone.

What Abuse-Case Eval Harness Do We Keep Running Post-Deployment?

A one-time assessment is a snapshot. Models get swapped, retrieval indices grow, tool sets expand, and prompts get edited under deadline pressure — every one of those changes can reopen a vector that was closed at launch. The deployments that stay defensible keep the abuse-case suite running as a standing harness, not a launch-day artifact.

In practice that means the adversarial test campaign from the initial assessment becomes a regression suite that runs on every change to the prompt, the index schema, the tool set, or the underlying model version. When you upgrade from one model generation to the next, the harness re-measures the injection hold rate before the change reaches production — because a model that is better at following instructions is, by the same property, often better at following injected instructions. This is the security slice of the broader production AI monitoring harness that watches an AI feature in operation. The methodology defines what to test; the harness applies it continuously to your deployed surface, and our AI engineering services are organised around standing it up against your specific RAG, chatbot, or agent.

Most AI features that reach a security review have never had their attack surface measured at all — security posture is part of the operational track record the buyer is then asked to defend, and an unmeasured surface is an indefensible one. Keeping the harness live converts “we believe it’s secure” into “here is the measured hold rate, and here is the regression suite that protects it.”

What Evidence Does an Enterprise Security Reviewer Expect From the AI Side of the Stack?

A reviewer evaluating an AI feature is building a case they can sign. They want artifacts, not assurances. The evidence pack from a defensive assessment maps directly onto what they will ask for:

A trust-boundary diagram of the AI surface, showing where untrusted text enters the model context.
A threat enumeration covering prompt-injection (direct and indirect), unsafe tool use, data leakage, and abuse cases, with each one marked mitigated, accepted, or out of scope.
Measured hold rates for the adversarial campaign, framed as rates across variants rather than single examples — an operational measurement against the deployed system, named as such.
The mitigations in place and the control that enforces each one, with explicit confirmation that access control sits outside the model.
The standing regression harness and the change-triggers that re-run it.

That pack is what closes a security review on the first pass instead of the third. It is also the AI-security complement to a generative-AI model-risk review that earns governance approval without theatre — the risk review owns model behaviour and governance; the security assessment owns the adversarial attack surface. Together they answer the two distinct questions an enterprise asks before it lets an AI feature touch its data.

FAQ

What does an AI security assessment cover beyond model accuracy?

It covers the attack surfaces specific to language-model systems: prompt-injection (direct and indirect), unsafe tool use by agents, data leakage through retrieval, and deployment-specific abuse cases. These are orthogonal to accuracy — a feature can answer correctly and still leak unauthorised data or execute attacker-chosen actions. The security lives in the integration, not in the model weights.

How do we test our own RAG deployment for prompt-injection and unsafe tool use?

You test the running system with its real retrieval index and tool bindings. For injection, seed the knowledge base with adversarial instructions and run a campaign of paraphrased system-prompt-override attempts, measuring the hold rate rather than checking a single string. For tool use, enumerate every reachable agent function and try to drive each one through adversarial input, verifying that access control sits outside the model.

What abuse-case eval harness do we keep running post-deployment?

The adversarial campaign from the initial assessment becomes a standing regression suite that re-runs on every change to the prompt, index, tool set, or model version. Because a model that follows instructions better also follows injected instructions better, the harness should re-measure injection hold rates before any model upgrade reaches production.

How does the assessment differ from generic penetration testing?

A penetration test targets infrastructure and code — ports, SQL injection, auth bypass, misconfiguration. An AI security assessment targets the model’s behaviour as an attack surface: prompt-injection, unsafe tool use, and retrieval-based data leakage. A key difference is reproducibility — code injections are deterministic, while jailbreaks are probabilistic and must be reported as measured rates across variants. The two are complementary, not interchangeable.

What evidence does an enterprise security reviewer expect from the AI side of the stack?

A trust-boundary diagram of the AI surface, a threat enumeration covering injection, tool use, leakage, and abuse cases (each marked mitigated, accepted, or out of scope), measured adversarial hold rates, the mitigations and the control enforcing each, and the standing regression harness with its change-triggers. That artifact pack closes a security review on the first pass instead of the third.

When the procurement reviewer asks the prompt-injection question — and on enterprise deals they now do — the only durable answer is a measured one: here is the rate at which our guardrail holds, against this index, with this tool set, and here is the harness that re-measures it the next time anything changes. A deployment that can produce that is defensible; one that cites the vendor’s compliance page is waiting to be blocked.