The AI Security Control Checklist — Cybershark Consulting

A working audit of the controls that actually hold — and a way to spot the ones that only look like controls. Print it, run it against your own AI system, and check the boxes you can honestly defend.

Start here: the one idea everything rests on

The model is never a security control. No amount of prompting, instruction, or training can stop an LLM from being influenced by its inputs in a way that's harmful. That means every real control must live outside the model. Prompting and fine-tuning improve the quality of output under normal operation — they are not, and can never be, security or safety controls.

Security vs. safety

Security is anything that impacts the technical integrity of the system itself. Safety is everything else. Rule of thumb: if it requires human interpretation to be dangerous, it's a safety issue. This checklist focuses on security — though most of these controls help with safety too.

Guardrails vs. tracks

A car can go anywhere

Off the road, into a ditch, through a storefront. So you bolt on guardrails, rumble strips, lane warnings, automatic braking — all compensating for the fact that the car can go anywhere. Every mainstream AI tool is a better guardrail on a car. Still a car. Still hoping the guardrail holds.

A train can't

Not because something steered it away — because it's on tracks. At every moment it's physically constrained to the path you laid. "What if it veers off" isn't a risk; veering off isn't a capability. Grammars and embeddings lay track. They're structural guarantees, not guardrails.

Section 1 · Foundational posture

We treat the model as untrusted. It is never itself a security control.

Every control we rely on lives external to the model.

We do not count prompting or fine-tuning as controls — only as output quality.

We distinguish security (system integrity) from safety (needs human interpretation to harm).

We own the downstream execution layer — the code that turns model text into real-world effects (the framework, the tool-calls, the glue) — and treat it as our true attack surface.

Section 2 · Controls that actually work

Grammars constrain output to the structure your grammar template defines — the model can only produce shapes you allow.

Embeddings keep the model on-subject and off sensitive subjects — the same classification / semantic-similarity tooling proven over a decade of search.

Our controls act at the generation layer — filtering malicious behavior before the model is allowed to generate it, not after.

Controls are structural (they lay track) rather than corrective (they patrol a road the agent can still leave).

Section 3 · Red flags — not a control, no matter who sold it to you

"We told the model not to…" — prompting. Improves quality; controls nothing.

"We fine-tuned it to refuse…" — training. Still inside the model; still bypassable.

An output classifier that catches bad responses after they're generated — cleanup, not prevention.

Execution gates that allow the action then correct it — too late. The tokens are spent and the money's gone into the void.

The evidence

Stress-tested across ~6.1 million inference calls, models 1.7B–119B params. Every behavioral and structural control was bypassed or allowed malicious generation — except one. A single generation-layer control had a provable 100% block rate. An abliterated model's data-exfiltration success fell from 97.85% to 0% under it.

Whitepaper · doi.org/10.17605/OSF.IO/S9GU6

Prove me wrong.

Prompt injection is a solved problem. Tantalus lets you try to break it yourself: a realistic AI agent with files, email, and chat — plus poisoned tools. Bypass every mainstream defense in round one; face the one generation-layer control in round two.

      Try it → tantalus.io
      Book a fit-call → cybersharkconsulting.com