How to prevent prompt injection: defense patterns that actually work
Prompt injection isn't going away. A practical look at the defence patterns that hold up in production B2B systems — instruction firewalls, retrieval isolation, output validation, and the limits of each.
Prompt injection is the SQL injection of the LLM era — except worse, because there's no equivalent of parameterised queries that solves it cleanly. Anything that mixes trusted instructions with untrusted text in the same prompt is a target, and the threat surface keeps expanding as agents get more tool access.
Here are the defence patterns we actually use in production B2B systems, what each one buys you, and what it doesn't.
Pattern 1 — separate trusted from untrusted explicitly
Most prompt injection works because the model can't tell where its instructions end and the user's content begins. The minimum hardening is to separate them with explicit boundaries:
- Wrap untrusted content in tagged delimiters that can't appear in the user input.
- State explicitly in the system prompt that anything inside those tags is data, not instructions.
- Tell the model to ignore any instructions found inside the data block.
This stops the simplest direct injection attacks. It does not stop sophisticated ones. The model is still ultimately a probabilistic system; a clever enough attacker can still flip its behaviour. Treat this as a baseline, not a solution.
Pattern 2 — least privilege on tool calls
If the model can call tools, the model can be tricked into calling tools it shouldn't. The defence is identical to defence-in-depth in any system: assume the caller is adversarial and gate every tool by what the actual user is authorised to do.
- Tool calls inherit the user's auth context, never the system's.
- High-impact tools (write, delete, send) are gated behind explicit user confirmation.
- Irreversible actions are never auto-executed — they always go through a human.
- Every tool call is logged with the prompt that triggered it for forensic review.
Pattern 3 — retrieval isolation
Retrieval-augmented systems are a major injection vector because the retrieved documents may contain adversarial content. A user can't usually inject directly — but a document somewhere in your corpus can.
Defences that hold up:
- Retrieved chunks are wrapped in the same trusted/untrusted delimiters as user input.
- Retrieved content is sanitised — strip URL-encoded payloads, base64 blobs, and obviously adversarial markup before embedding.
- If documents come from user-uploaded sources, treat them as fully untrusted and consider a separate scoring layer that rejects suspicious chunks.
- Per-user retrieval namespaces — a user can only retrieve from documents they're authorised to see, even if the embedding model would surface others.
Pattern 4 — output validation as a safety net
Even if the prompt is compromised, you can often catch the consequence in the output. This is where output validation pays for itself a second time.
- Schema validation — if the output isn't valid JSON conforming to the expected schema, reject it. Most jailbreaks produce output that doesn't conform.
- Citation enforcement — if claims must reference retrieved sources by ID and the output cites a source that wasn't in the retrieval result, reject it.
- PII / sensitive data leak detection — scan the output for content that wasn't in the user's authorised context.
- Topic / scope classifier — outputs that drift outside the intended workflow (e.g. an HR assistant suddenly answering legal questions) are rejected.
Pattern 5 — adversarial evals as a continuous test
Build a corpus of known prompt injection patterns — direct ("ignore previous instructions"), indirect (poisoned retrieved content), encoding-based (base64, leet-speak), role-play ("pretend you are"), instruction smuggling (instructions hidden in URLs, code comments, or images), and run them against every change. Treat any failure as a launch blocker.
There are open-source corpora you can start from, but the durable ones are the ones you write from your own production logs. Every successful injection in the wild becomes a permanent test case.
What doesn't work — or works less than you think
- Telling the model very firmly to ignore injection attempts. This works for the first wave of attacks. It does not generalise.
- A second LLM that classifies whether the user input contains injection attempts. The classifier is also an LLM and is also injectable. Useful as defence-in-depth, not as a sole defence.
- Trusting that future model versions will be immune. They won't.
- Hiding the system prompt. It will leak. Plan for that.
The honest summary
There is no single fix for prompt injection — only a stack of imperfect defences that, together, push the cost of attack high enough that an attacker moves on. The system is secure when the value of breaking it is less than the effort required.
The systems that fail in production are the ones that bet on a single defence — usually "the model will figure it out." The systems that hold up are the ones that assume every layer can be breached and put a control in front of every action that matters.
We run prompt-injection adversarial reviews as a one-week engagement against existing AI systems before they ship to customers. If you want a hardened second opinion before launch, get in touch.
