Do deployed AI security controls prevent malicious output — or remediate it after generation?
Across a controlled harness of ~6.1M inference calls on models ranging 1.7B–119B parameters, I evaluated whether common guardrails block malicious generation or merely act on it post hoc. Every behavioral and structural control tested was bypassed or permitted malicious generation, except one applied at the generation layer, which blocked it in 100% of in-harness attempts. Under that control, an abliterated model's data-exfiltration success fell from 97.85% to 0%.
Method
A fixed adversarial harness issued ~6.1M inference calls against models spanning 1.7B–119B parameters, exercising behavioral controls (refusal training, classifiers) and structural controls (sanitization, output filtering, validators) as deployed in typical GenAI integrations. Each control was scored on whether malicious output was prevented or merely acted on after generation.
Findings
Every behavioral and structural control tested was bypassed or allowed malicious generation — except one. A single generation-layer control blocked malicious output in 100% of in-harness attempts, before the model could produce it. An abliterated model — refusal mechanism removed — dropped from 97.85% data-exfiltration success to 0% under that control.
Scope & limits
Results are qualified to this test harness — not a claim that prompt injection is solved in general. The figures describe in-harness behavior across the models and attack classes tested. Cost savings discussed elsewhere on this site are a modeled projection that scales with deployment, not a measured client outcome.