A working audit of the controls that actually hold — and a way to spot the ones that only look like controls. Print it, run it against your own AI system, and check the boxes you can honestly defend.
The model is never a security control. No amount of prompting, instruction, or training can stop an LLM from being influenced by its inputs in a way that's harmful. That means every real control must live outside the model. Prompting and fine-tuning improve the quality of output under normal operation — they are not, and can never be, security or safety controls.
Security is anything that impacts the technical integrity of the system itself. Safety is everything else. Rule of thumb: if it requires human interpretation to be dangerous, it's a safety issue. This checklist focuses on security — though most of these controls help with safety too.
Off the road, into a ditch, through a storefront. So you bolt on guardrails, rumble strips, lane warnings, automatic braking — all compensating for the fact that the car can go anywhere. Every mainstream AI tool is a better guardrail on a car. Still a car. Still hoping the guardrail holds.
Not because something steered it away — because it's on tracks. At every moment it's physically constrained to the path you laid. "What if it veers off" isn't a risk; veering off isn't a capability. Grammars and embeddings lay track. They're structural guarantees, not guardrails.
There are only a handful. The simplest two are the strongest.
Stress-tested across ~6.1 million inference calls, models 1.7B–119B params. Every behavioral and structural control was bypassed or allowed malicious generation — except one. A single generation-layer control had a provable 100% block rate. An abliterated model's data-exfiltration success fell from 97.85% to 0% under it.
Whitepaper · doi.org/10.17605/OSF.IO/S9GU6
Prompt injection is a solved problem. Tantalus lets you try to break it yourself: a realistic AI agent with files, email, and chat — plus poisoned tools. Bypass every mainstream defense in round one; face the one generation-layer control in round two.