How to deploy AI in production safely: a B2B leader's checklist
The production-readiness checklist for AI systems that handle real customer impact. Covers evals, output validation, observability, security review, and the failure modes nobody warns you about.
Most production AI failures aren't dramatic. They're a slow leak — a slightly worse customer experience here, a hallucinated number in a report there, a confidently wrong answer to a regulator's auditor six months from now. By the time you notice, you've quietly eroded trust in the system.
Production-grade AI looks different from prototype AI in ways that aren't visible from the outside. The model is the smallest part. Everything around it is the work. Here's the checklist we walk through with clients before any AI system gets pointed at real customers.
Evals — the part nobody wants to do
If you do nothing else from this list, do this. An eval is a fixed set of representative inputs with known-good outputs that you can run on every prompt change, model change, or version bump. Without evals, you have no answer to the question "is this version better than the last one?" — you just have vibes.
The eval set doesn't need to be huge. Fifty cases curated by the people who actually understand the workflow beats five thousand cases generated automatically. Add a case every time something goes wrong in production. Over a year you'll have an eval suite that captures every failure mode you've ever seen, and any future change has to clear that bar before it ships.
Output validation — the cheap layer that catches expensive mistakes
An LLM will, with some non-zero probability, produce output that violates your contract with the user. Maybe it makes up a citation. Maybe it answers a question outside its scope. Maybe it leaks PII from a retrieved document. Maybe it gives medical advice when it shouldn't.
You catch this with a layer that runs after the model returns and before the output reaches the user. The layer is cheap, deterministic, and stupid — exactly the qualities you want in a guard. Examples:
- Schema validation — output is JSON conforming to a schema, or it's rejected.
- Citation enforcement — every claim must reference a retrieved source by ID, or it's rejected.
- PII detector — output is scanned for names, numbers, emails not present in the user's authorised context, and rejected if found.
- Topic classifier — output is classified as on- or off-topic, off-topic outputs replaced with a refusal.
- Toxicity / safety classifier — output passes a content filter before display.
Run as many of these as your latency budget allows. They're cheap and they catch real problems.
Observability — log everything, redact what matters
Every model call should be logged with: timestamp, user, input, output, latency, token counts, cost, model version, prompt version, and the result of every validation check. You will need all of this. The first time something goes wrong in production, you will spend a day reconstructing what happened — unless you logged it.
Redact PII at the logging layer. Don't store unredacted user inputs in your observability stack. This is a cheap mistake to make and an expensive one to clean up.
Rate limits, circuit breakers, and graceful degradation
AI systems fail in three modes that traditional services don't. They get expensive (a runaway agent loop). They get slow (model provider degradation). They get wrong (a model update that subtly changes outputs). For each, you need a defence:
- Rate limits per user, per key, per workflow — protects against runaway loops and abuse.
- Cost circuit breakers — daily and hourly spend caps that page on-call when breached.
- Latency-based circuit breakers — if model latency exceeds threshold, fail over to a smaller model or a non-AI fallback.
- Eval gates on model changes — no new model version reaches production until it passes the eval suite.
Security review — assume the prompt is hostile
Anything in your prompt that came from a user, a document, or a retrieved web page is hostile by default. Treat it like SQL — escape, validate, isolate. The detail of how to do this is a whole separate post (we wrote one — see the prompt injection piece).
The minimum viable security review for production AI:
- Threat model the system: who's the adversary, what are they trying to do, where do they get input?
- Document every place untrusted text enters a prompt. For each, document the mitigation.
- Document every tool the model can call. For each, document who's authorised, what's reversible, and what damage a malicious call could do.
- Run an adversarial eval suite — known prompt injection patterns, jailbreaks, scope-evasion attempts. The system should refuse all of them.
- Get a written sign-off from your security team. If you don't have one, hire someone for a one-week review.
Human override and rollback
Every production AI workflow should have a working off-switch — a config flag that turns the AI behaviour off and routes everything to the previous human-only path. Test the off-switch regularly. The first time you need it will not be a good time to discover that it's broken.
For agentic workflows specifically: every action the agent takes should be reversible, or gated behind a human review for the irreversible ones. Sending an email is reversible (you can apologise). Wiring money is not. Refunding a customer is not. Posting publicly is barely reversible. Build the gates accordingly.
The launch criteria
Before pointing the system at real customers, you should be able to answer yes to all of the following:
- Eval suite exists, passes, and runs on every change.
- Output validation layer is in place and tested.
- All calls are logged with PII redaction.
- Rate limits and cost circuit breakers are active.
- Latency-based fallback is tested.
- Threat model is written and signed off.
- Adversarial evals pass.
- Off-switch is tested and on the runbook.
- Irreversible actions are gated.
- On-call rotation knows the system exists and how to triage it.
This is more work than most teams budget for. It's also the difference between AI that quietly compounds value and AI that quietly compounds risk.
We run this checklist as a one-week production-readiness review against your specific system before it ships. Get in touch if you'd like to find the gaps before your customers do.
