Cybershark | Research PREPRINT · OSF · DOI:10.17605/OSF.IO/S9GU6

Empirical study · GenAI security controls

Do deployed AI security controls prevent malicious output — or remediate it after generation?

Abstract

Across a controlled harness of ~6.1M inference calls on models ranging 1.7B–119B parameters, I evaluated whether common guardrails block malicious generation or merely act on it post hoc. Every behavioral and structural control tested was bypassed or permitted malicious generation, except one applied at the generation layer, which blocked it in 100% of in-harness attempts. Under that control, an abliterated model's data-exfiltration success fell from 97.85% to 0%.

Read the full whitepaper Try the Tantalus arena

control survival across harness

sanitization

classifiers

refusal training

output filter

gen-layer ctrlheld

Fig. 1 — share of attempts that reached malicious output, by control class. Lower is better; the generation-layer control held across the harness.

Method

A fixed adversarial harness issued ~6.1M inference calls against models spanning 1.7B–119B parameters, exercising behavioral controls (refusal training, classifiers) and structural controls (sanitization, output filtering, validators) as deployed in typical GenAI integrations. Each control was scored on whether malicious output was prevented or merely acted on after generation.

Findings

Every behavioral and structural control tested was bypassed or allowed malicious generation — except one. A single generation-layer control blocked malicious output in 100% of in-harness attempts, before the model could produce it. An abliterated model — refusal mechanism removed — dropped from 97.85% data-exfiltration success to 0% under that control.

Scope & limits

Results are qualified to this test harness — not a claim that prompt injection is solved in general. The figures describe in-harness behavior across the models and attack classes tested. Cost savings discussed elsewhere on this site are a modeled projection that scales with deployment, not a measured client outcome.

How to cite

Ovando, V. (2026). Do deployed AI security controls prevent malicious output, or remediate it after generation? OSF Preprint. https://doi.org/10.17605/OSF.IO/S9GU6

Want this bar around your own integration layer?

Bring me your riskiest GenAI integration point. 25 minutes, no pitch — we pressure-test it live and you leave with a threat-model sketch.

Book a fit-call See the research overview