The one-prompt drift classifier we run on every AI draft | AI Automation & Growth Insights

June 6, 2026

Every AI writing system drifts. Not in one big break, but slowly. A model that nailed the brief in week one is, by week six, producing copy that is technically fine and quietly off. Off-topic by a degree. Off-voice by a hair. Hallucinating a statistic that sounds right. The drift is small enough that no single draft looks wrong and large enough that, in aggregate, the program rots.

We caught this the expensive way once, and now we run an AI content quality gate that flags drift before anything publishes. The whole gate is one short classifier prompt. Here is exactly what it does and why a single prompt is enough.

What drift actually looks like

Drift is not a single failure mode, so we had to name the ones we care about before we could detect them. After auditing a few hundred bad drafts we found they clustered into four kinds.

The first is topic drift: the draft wanders off the assigned subject, usually by generalizing until it could be about anything. The second is voice drift: it abandons the brand's stance and register, often sliding into the generic upbeat tone every model defaults to. The third is factual drift: it asserts a number, date, or claim that is not supported by the source material it was given. The fourth is structural drift: it falls into a formula, the listicle skeleton or the mirrored-clause habit, regardless of what the brief asked for.

We do not try to fix these inside the classifier. The classifier only detects and labels. Fixing is a separate step, because mixing detection and repair into one prompt makes both worse.

One prompt, four scores

The gate is a single call to a small, fast model. We feed it three things: the original brief, the source material the writer was given, and the draft. The system prompt asks for one job. Score this draft on four axes, each from 0 to 10, and return strict JSON with a one-line reason per axis.

The four axes map exactly to the four drift types. Topic adherence. Voice match. Factual support. Structural freshness. We deliberately ask for numeric scores rather than a yes-or-no verdict, because a threshold we control is more useful than a binary the model picks. We set the bar at 7 on every axis. Any axis under 7 fails the draft.

The reason field is the part that earns its place. A bare score tells us a draft is bad. The reason tells the next step where and why, so the rewrite is surgical. "Factual support: 4. The 32 percent figure in paragraph three does not appear in the source." That one line turns a vague rejection into a targeted fix.

We keep the prompt short on purpose. A long rubric with twenty sub-criteria actually scores worse, because the model spreads its attention thin and the scores get mushy. Four sharp axes outperform twenty fuzzy ones every time we have tested it.

Why a classifier and not the writer

The obvious question is why we do not just ask the writing model to check its own work. We tried. It does not work, for a reason that is structural rather than fixable.

A model grading its own output is biased toward approving it. It already committed to the choices in the draft, so it rationalizes them. The grades come back uniformly high and useless. Separating the roles fixes this. The classifier never saw the draft get written, has no stake in it, and is prompted only to find fault. The same model that wrote a generous self-review will, in the critic seat with a fresh context, catch its own drift.

We also run the classifier as a different, smaller model than the writer where we can. It is cheaper, it is fast, and the independence is a feature. The critic having a different blind spot than the author is exactly what we want.

Tuning the threshold

The threshold is the whole game, and we did not get it right on the first try. Set it too high and the gate rejects good drafts, the loop burns calls, and the writers fight it. Set it too low and drift slips through and the gate is theater.

We tuned it with a small labeled set. We took 100 drafts, had two people independently mark each as ship or do-not-ship, kept the ones they agreed on, and ran the classifier against that ground truth. Then we swept the threshold and looked at where the classifier's verdicts matched the humans best. Seven out of ten on every axis was the point where the gate caught nearly all the genuinely bad drafts while rejecting very few good ones. We re-run that check every few months, because the writing models change under us and a threshold tuned for one model version can drift on the next.

We also weight the axes differently in one respect. Factual support is a hard gate. A draft can be a little flat on structural freshness and still ship with a light human polish, but an unsupported statistic is a non-negotiable fail, because a wrong number in a client's content is the kind of error that costs trust and is expensive to walk back. So factual support under threshold blocks publish outright, while the other three can be overridden by a human reviewer who sees a reason to.

Where it sits in the pipeline

The gate runs after the editorial voice checks and before the human queue. The order matters. The mechanical checks, forbidden words and banned patterns, run first because they are deterministic and free. The classifier runs second because it costs a call and a fraction of a cent, and there is no reason to spend that on a draft a regex already rejected.

A draft that clears both reaches a person. A draft that fails the classifier goes back to the writer with the failing axes and their reasons as the next prompt, and the writer revises only what was flagged. We cap the loop at three rounds. Past three, it escalates to a human, because a draft that keeps drifting after three targeted rewrites usually has a flawed brief underneath it, and a person needs to look at that brief.

A real run

We added this gate to a content program that was producing about 50 pieces a week across a handful of brands. Before the gate, our human reviewers were the drift detector, and they were missing things, because reading 50 drafts a week for subtle off-voice slippage is a task humans are bad at. Our spot audits found drift in roughly one in five published pieces, mostly the factual and voice kinds that are hardest to catch by eye.

After the gate went live, the classifier caught the large majority of those before publish. The number that surprised us was factual drift specifically. The classifier flagged unsupported statistics at a rate our human reviewers had been missing almost entirely, because checking every number against the source is exactly the kind of patient cross-referencing a person skips when tired and a model does not. Published drift dropped to a level our audits now struggle to find at all.

The cost of all this is about one cheap model call per draft, a couple of cents a week at our volume against the cost of one hallucinated statistic reaching a client's audience. That trade is not close.

What the classifier does not do

It is worth being honest about the limits, because a gate you trust too much is its own failure mode. The classifier checks whether a draft holds to its brief, its voice, its sources, and a fresh structure. It does not judge whether the idea is any good. A draft can be perfectly on-brief, factually clean, and still argue something obvious or boring, and the classifier will wave it through with high marks. That judgment stays human, which is exactly why the gate exists: to clear the mechanical drift off the reviewer's plate so the scarce human attention goes to the question of whether the piece is worth publishing at all.

The classifier also cannot catch a claim that is unsupported by the wider world but is supported by a source we fed it. If the source material itself is wrong, the draft will pass factual support while still being false. We handle that one level up, by vetting the source material that goes into the brief, because no downstream check can rescue a pipeline that starts from bad inputs.

Those limits noted, the gate does the job we built it for. It turns slow, invisible drift into a number we can see and act on before publish.

The lesson we keep relearning is that AI content systems do not fail loudly. They drift. A lightweight classifier, one prompt with four axes and an honest threshold, sitting between generation and publish, is the cheapest insurance we run. We put one in front of every content pipeline we operate at arthea.ai, and we would not ship AI copy at volume without it.

Go back to Blog

Last-click attribution tells you content does not work, right when it is starting to. Here is the lagged, assisted content attribution model we run instead.

Running SMS and email orchestration as separate programs trains people to ignore both. Here is the channel-priority, frequency-cap, and cross-suppression logic we run.

Our median lead response went from 6 hours to 90 seconds. Here is the n8n automated lead routing workflow: capture, enrich, score, assign, alert, with real timings.

Architecture Notes

Occasional insights on infrastructure, conversion systems, retention architecture, and AI deployment, shared when they’re worth reading.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.