The guardrail stack that stops an AI agent shipping off-brand work | AI Automation & Growth Insights

June 6, 2026

The day an agent went off-brand at scale

A prompt tweak we thought was harmless changed the tone of every social draft our content agent produced overnight. The next morning there were 40 drafts in the queue that read like a generic SaaS landing page. Hype words we never use. Em-dashes everywhere. The two-clause mirror sentences that make AI writing obvious from a mile away.

None of it shipped, because we caught it in review. But the lesson was clear. A single human review pass is not a guardrail. It is a bottleneck that fails the moment volume rises or attention slips. So we built AI agent guardrails as a stack, four layers deep, where no single failure can put off-brand work in front of a client.

Layer one: input context

The cheapest place to stop bad output is before the agent generates anything. Most off-brand work is not the model misbehaving. It is the model never being told what on-brand means.

So every content agent receives a context bundle on every invocation, not just at setup. The bundle carries the brand voice rules as hard constraints, the forbidden-word list, the formatting conventions, and a handful of approved reference artifacts. We serve this from a single endpoint so there is one source of truth, and we version it. When the voice rules change, every agent picks up the new version on its next run.

The key design choice is that the constraints are concrete rather than vague. We do not tell the agent to write in a confident tone. We give it an explicit list of forbidden words, an explicit ban on em-dashes, an explicit cap on sentence patterns. An agent can comply with a rule it can check. It cannot comply with a feeling.

Versioning the bundle is what saved us from the original incident repeating. When that bad prompt change went out, it edited the agent's instructions directly, bypassing the shared voice rules. Now the voice rules live outside any individual agent's prompt, served from one endpoint, and an agent cannot edit them by accident. A prompt change can still make an agent worse at following the rules, but it cannot change what the rules are. That separation between the agent's task and the brand's constraints is the whole point of layer one.

Layer two: the voice gate

Input context reduces bad output. It does not eliminate it, because models drift and prompts get edited. So the second layer is a deterministic voice gate that runs on every generated artifact before it can move forward.

This layer is not an LLM. It is plain code, because the rules that matter most are mechanical and a regex never has an off day. The gate scans for forbidden words, for em-dashes, for the AI-formula sentence patterns, for the wrong domain, for length outside the target range. Any hit fails the artifact and bounces it back to the agent with the specific violations listed, so the next attempt is a targeted fix rather than a blind retry.

The gate that would have caught our 40-draft incident is about 30 lines of code. It runs in milliseconds and it has zero tolerance. We would rather an agent retry three times than ship one draft with a banned word in it.

Layer three: output validation

Mechanical rules catch mechanical violations. They do not catch a draft that is on-brand word by word and wrong as a whole. A factually incorrect claim. A broken link. A post about the wrong product. A reply that misreads the inbound entirely.

So the third layer validates the output against its own job definition. For a social draft, that means the link resolves, the article it references exists, the claim is supported by the source body, and the post matches the campaign it was assigned to. For a cold-reply classification, it means the tag is one of the allowed values and the confidence clears a threshold before it triggers a deal handoff.

This layer can use a model, because the judgment is semantic. We run a second agent as the validator, with a narrow prompt: here is the artifact, here is the job it was supposed to do, list every way it fails. Using a separate agent matters. The agent that generated the work is the worst judge of whether it is correct.

We keep the validator's scope deliberately narrow. It does not improve the work or rewrite it. It only judges, and it returns a structured verdict: pass, or fail with a list of specific defects. A validator that tries to fix things blurs the line between generation and checking, and a checker you cannot trust to be impartial is not a check at all. When it fails an artifact, the defects route back to the generating agent as a targeted revision request, the same way the voice gate does, so the loop stays tight and the agent learns exactly what to fix.

There is a cost tradeoff here, which we account for honestly. Running a second model on every artifact roughly doubles the token spend on the generation step. We accept it for client-facing work where a wrong artifact is expensive, and we skip it for low-stakes internal artifacts where the first two layers are enough. The decision of where to run layer three is itself a unit-economics question, and we make it per artifact type rather than blanket-applying it.

Layer four: human-in-the-loop for irreversible actions

The first three layers let work flow without a human. The fourth layer deliberately stops it. The rule is simple and it is the one we never bend. Any action that cannot be undone, or that encodes taste we have not delegated, requires a human to approve before it executes.

Publishing a post is irreversible, so final publish is gated on operator validation. Sending an invoice over a dollar threshold is gated on sign-off. Converting a lead into a client, deleting a record, dispatching money. All of these stop at a human queue, which is exactly the needs-input bucket from our daily standup.

Everything reversible flows freely through the first three layers. Drafting, classifying, enriching, scheduling into a holding state. The human is reserved for the small set of actions where a mistake is expensive and permanent. That is the whole philosophy of the stack. Automate aggressively where errors are cheap and recoverable, gate hard where they are not.

Drawing the reversible line precisely is the work. Scheduling a post into a holding state is reversible, so it flows. The final publish is not, so it stops. Drafting an invoice is reversible. Sending it is not. Classifying a reply as positive is reversible. The deal handoff it can trigger touches the pipeline, so we gate that on a confidence threshold and route the borderline cases to a human. The skill is in finding the last reversible step in each chain and putting the gate immediately after it, so the maximum amount of work happens autonomously and the human only ever sees the final, irreversible decision.

We also keep the human gate cheap to operate, because a gate that is painful to clear becomes a backlog, and a backlog tempts people to remove the gate. The pending decisions land in a single queue, each with the context to decide in seconds and a one-click action to approve or reject. That queue is the same needs-input bucket the operator already reviews in the daily standup, so the human gate adds no new surface to monitor. It folds into a routine that already exists.

Why four layers instead of one

A single check, however good, is a single point of failure. The four layers exist because they fail in different ways and catch each other's misses. Input context fails when a prompt is edited. The voice gate catches that, but only mechanical violations. Output validation catches the semantic errors the gate cannot see. And the human gate catches the rare case where everything automated passed and the work is still wrong in a way only judgment detects.

Since we shipped the full stack, no off-brand draft has reached a client, across thousands of generated artifacts. The 40-draft incident that started this would now be caught at layer two, in milliseconds, before a human ever opened the queue.

If you are relying on a person to catch your agents' mistakes, you have one layer and it is the one most likely to fail under load. Put the mechanical rules in code, validate output against its own job, and reserve the human for the actions you genuinely cannot take back. Then the agents can move fast, and the brand stays intact while they do it.

Go back to Blog

Last-click attribution tells you content does not work, right when it is starting to. Here is the lagged, assisted content attribution model we run instead.

Running SMS and email orchestration as separate programs trains people to ignore both. Here is the channel-priority, frequency-cap, and cross-suppression logic we run.

Our median lead response went from 6 hours to 90 seconds. Here is the n8n automated lead routing workflow: capture, enrich, score, assign, alert, with real timings.

Architecture Notes

Occasional insights on infrastructure, conversion systems, retention architecture, and AI deployment, shared when they’re worth reading.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.