Per-token costs are trivial. Per-outcome costs are everything.

May 7, 2026

Per-token costs are trivial. Per-outcome costs are everything. The vendor pitch deck quotes the first number because it is three orders of magnitude cheaper than a freelance copywriter. The CFO who funds the program quotes the second one, and almost nobody on the team can compute it correctly because the costs that matter live outside the model bill. This piece is for operators who have to defend an AI-content budget to a board that already heard the per-token math and was unimpressed.

Most teams hit the same wall by month three. The model spend is a rounding error. The operator-review minutes, the iteration loops, the off-brand drafts that ship anyway, and the slow trust erosion across the audience are not. None of those line items appear in the vendor invoice. All of them appear in the P&L if anyone bothers to add them up.

Why per-token cost is the wrong unit

A short social post drafted by a current-generation model costs a fraction of a cent in tokens. A freelance copywriter doing the same post costs hours of working time. The gap is roughly three orders of magnitude, which is why every AI-content vendor cites it on slide three of the deck. The number is real. The number is also useless for deciding whether to fund the program.

The number that matters is what it costs to ship a post that performs at parity with a human-written one. That bill includes operator review minutes, the iterations required to fix register, the drafts that get thrown out, and the off-brand outputs that ship anyway and quietly erode trust over a quarter. None of those line items show up in the per-token math, which is precisely why the per-token math wins the pitch and loses the renewal.

The hidden cost stack

Every AI-content workflow carries the same hidden stack. Brief construction time. Operator review per draft. Iteration cycles when the first draft misses. Voice-drift tax when an off-brand draft ships and audiences quietly reprice trust. Engineering time on the contract layers that prevent the previous two from happening. Add those four together and the cost per shipped, performant artefact is rarely cheaper than a competent freelancer; it is just predictable, and predictable is a very different thing from cheap.

The same shape outside content

The same shape shows up outside content. A Klaviyo welcome flow billed by sender CPM is cheap. A welcome flow that actually converts at parity with a hand-built one costs the same architectural work the hand-built one would, plus the deliverability hygiene the cheap version skipped. A Webflow page generated from a template is free in tooling. A Webflow page that scores 100 on Lighthouse and earns its place in the funnel costs the architectural work of a real CRO build, regardless of whether a model wrote the first draft. The unit-cost trick is the same in every category: bill the cheap part, hide the structural part, hope nobody adds them up.

Four levers that change the per-outcome math

Get these four right and a competent model is enough. Get them wrong and the most expensive vendor on the market produces middling output. Per-token cost is the wrong unit. Per-outcome cost is the unit that decides whether the program survives the next budget review.

Brief schema. The model needs the same input a human writer would need: pillar, voice, hook, key points, CTA, source. Most teams give models a thinner brief than they would give a junior copywriter and then blame the model when the copy comes back shallow. The result is an input problem wearing a model-problem costume. Generic inputs produce generic outputs across every category, regardless of which model runs the draft. The brief schema is a one-page document. It is the cheapest line item in the whole stack and the one with the largest swing on per-outcome cost.

Voice contract. A written specification of what your brand sounds like, enforced as a system prompt going in and a scanner running against every output. Drafts that fail the scanner get rewritten before any human sees them. This layer is what prevents the slow brand drift that sinks most AI-content programs by month three. The retention equivalent is a deliverability contract: segment-age caps, frequency caps, suppression rules, all enforced on every send before it leaves the queue. The CRO equivalent is an architecture contract: budget for layout shift, third-party tag policy, render-path discipline. Same shape, different surface.

Observability. Logs structured around outcomes. When a draft is bad, you need to know in seconds whether the brief was thin, whether the voice gate caught it, or whether the model regressed. Generic LLM logging tells you which API call returned what; it does not tell you which brand decision broke. The same shape carries over to retention (revenue per inbox-placed send) and CRO (conversion per device-class per window). If your dashboard cannot answer the question "which lever moved this number" inside thirty seconds, the dashboard is decoration.

Iteration discipline. Programs that succeed iterate the brief. Programs that fail iterate the model: switching providers, escalating to a larger one, commissioning custom fine-tunes, and never returning to the brief that produced the bad draft in the first place. Provider-shopping is the single most expensive failure mode in this category because every switch resets the contract layers and burns another quarter of operator time on integration work that does not move the per-outcome number.

A worked illustration of per-outcome cost

Take a brand publishing five posts a week. Per-token cost on a competent model is a few cents per post, call it a couple of dollars a month at the API line. That is the number on the vendor deck. Now layer in the real bill, framed as illustrative math rather than a case study.

Operator review at fifteen minutes per draft, twenty drafts a month, at a senior content rate. Iteration cycles where roughly one in three drafts needs a second pass, doubling the review time on those. Brief construction at a couple of hours a month for the program lead. Engineering time on the voice scanner and brief schema, amortised across the year. Drift tax: the percentage of drafts that ship slightly off-brand and pull the trust line down, recovered through a quarterly rebrief. Add those up and the per-outcome cost per shipped post is on the order of a competent freelancer, not three orders of magnitude lower.

The point of the illustration is not the specific dollars. The point is that the per-token line is not even in the top five cost drivers, and any business case built on it is built on the wrong number. The teams that survive the second budget review are the ones whose case sits on per-outcome math from day one.

A second illustration to make the shape concrete on the retention side. A welcome flow runs across a list and the per-send cost is fractions of a cent. The flow either places in the inbox at parity with a hand-architected one, or it does not. If it does not, the cost stack absorbs the deliverability remediation work, the suppression-list cleanup, the segment-age remapping, and the slow revenue underperformance until the issue is caught. None of that appears on the platform invoice. All of it shows up in the share-of-revenue number when the quarterly review compares the program to the published 25 to 40 percent band on /email-and-sms. Per-send cost is the cheap line. Per-revenue-point cost is the line that decides whether the program is fundable.

Runbook: how to compute your real per-outcome cost

A six-step process you can run in a single afternoon, with a competent finance partner in the room.

1. Pick one shipped artefact class and one month of output. Posts, emails, landing-page variants, ad copy. One class. One month. Resist the urge to model the whole program before you have modelled one slice. 2. Count actual shipped, performant artefacts. Not drafts produced. Not drafts queued. Artefacts that shipped and met the bar. This is the denominator. If you cannot define "met the bar," the program does not have a quality contract and that is finding number one. 3. Sum the model bill for that month. Easy. This is the line everyone has. 4. Sum the operator-review minutes. Pull from calendar, from review tool history, from a two-week sample if you have to. Multiply by loaded cost. 5. Sum the iteration tax. Drafts that needed a second pass, drafts that got thrown out entirely, drafts that shipped off-brand and triggered downstream rework. The honest version of this number is uncomfortable. That is the point. 6. Divide total cost by performant artefacts shipped. That is your per-outcome cost. Compare it to a freelancer rate or an in-house writer rate, both fully loaded. Now you know whether the program is winning, breaking even, or quietly losing.

Run the same six steps a quarter later. The number should drop. If it does not, the contract layers are not getting the investment they need, and the next quarter will be worse.

A note on what to share with the board. The headline number is the per-outcome cost trend, quarter over quarter, on the same artefact class. Underneath it sit the four lever investments (brief schema maturity, voice contract coverage, observability depth, iteration discipline) each scored qualitatively. The line nobody on the board needs is the per-token spend, because it is a rounding error in the actual cost stack and putting it on the slide invites the wrong conversation. Operators who put per-outcome at the top and per-token in an appendix get the renewal. Operators who do the opposite get a quarter to defend the program and another quarter to wind it down.

When per-token cost is actually the right unit

There is a narrow case where per-token economics dominate and the contract layers do not pay back. High-volume, low-stakes utility text where nobody downstream cares about voice. Internal summarisation pipelines. Bulk classification. Translation drafts that a human will fully rewrite anyway. In those cases, the model bill is most of the cost, the operator-review layer is thin or absent, and the per-token number is the honest number.

The mistake is generalising from those cases to brand-facing content. Brand-facing content is the opposite shape: low volume, high stakes, voice-load-bearing, and audiences that reprice trust on every drift signal. The contract layers are not optional there. They are the entire program. Treating brand-facing content like internal summarisation is the most common way operators end up paying full price for a platform that delivers freelancer-grade output.

What success looks like at the program level

On retention work shipped through /email-and-sms, brands on the 90-day Retention Architecture typically see 25 to 40 percent of total revenue driven by Klaviyo within three months, with the first measurable uplift inside ten to fourteen days. The aggregate published revenue across the program is +5M EUR. Per-week or per-client deltas stay on the brand side of the wall.

On CRO engagements through /websites-cro, the published outcome is 100 on Lighthouse and a 20 to 40 percent conversion uplift band on the Architecture Build. The marginal cost of the next test inside that engagement is the brief, not the model, not the engineering, not the design. That is what good per-outcome economics look like in practice: the contract layers are paid for once, and every subsequent artefact rides on top of them.

On AI Lab work, the program-level aggregates we publish are 20 percent retention lift, 60 percent operator time saved, and a doubling of YoY revenue across operations we have rebuilt. The pattern is consistent: outcomes at the program level, no fabricated weekly numbers, no vanity per-token bragging.

The qualitative bar across all three is the same. Outputs read on-brand without operator rewrite. Drift is caught before audiences see it. The team spends their hours on briefs and decisions, not on cleaning up drafts. When that shape holds, per-outcome cost trends down quarter over quarter even as artefact volume scales up. That is the curve that makes the program defensible.

A second-order signal worth tracking: the conversation in the operations review. A program winning on per-outcome cost has reviews that focus on which brief field needs sharpening, which voice contract clause caught the most drift this month, and which observability dashboard most often answers the team question. A program losing on per-outcome cost has reviews that focus on which model to switch to, which vendor pitch came in this week, and whether to commission a custom fine-tune. The vocabulary shift is the leading indicator. By the time the per-outcome number itself moves, the conversation shift has already been visible for a quarter.

Where the math flips

The ratios above only land at scale. A content drafter for a brand publishing once a week, a retention agent for a small list, a CRO test stack for a page receiving a trickle of traffic; none of these exercise the contract layer often enough to amortise it. Below the threshold, the structural work does not pay back, and the honest answer is to either run the program manually for another quarter or postpone it until volume justifies the build.

This is why our engagement threshold on /websites-cro starts at 30 to 50 thousand EUR per month in brand revenue, and we say so on the page. Below that line the per-outcome math does not work and we do not pretend it does.

FAQ

What is per-outcome cost in AI content? Per-outcome cost is the total cost to ship one performant, on-brand artefact: model bill plus brief construction plus operator review plus iteration tax plus the contract-layer engineering amortised across the program. It is the only AI-content cost number that survives a CFO review.

Why do per-token comparisons mislead operators? Per-token cost is three orders of magnitude lower than freelance writing, which makes the pitch easy and the program brittle. The model bill is rarely in the top five cost drivers of a brand-facing content program, so optimising it does not move the number that matters.

How often should we recompute per-outcome cost? Once a quarter on a one-month sample, using the six-step runbook above. The number should trend down as the contract layers compound. If it stays flat for two quarters, the contract layers are underfunded and the program is one staffing change away from regression.

Can a smaller, cheaper model close the per-outcome gap? Almost never on its own. Model size is rarely the binding constraint once you have a competent baseline. Brief schema, voice contract, observability, and iteration discipline are. A smaller model with strong contracts beats a larger model with weak ones on per-outcome cost almost every time.

Is the per-token number ever the right unit? Yes, in narrow cases: high-volume internal summarisation, bulk classification, translation drafts that a human will fully rewrite anyway. In those, the operator-review layer is thin and the model bill dominates. Brand-facing content is the opposite shape and should never be costed that way.

- https://www.arthea.ai/article/prompt-engineering-misallocation - https://www.arthea.ai/email-and-sms - https://www.arthea.ai/ai-lab

If you want a 30-minute architecture review on per-outcome cost for your content program, the calendar is here: arthea.ai/book.

Go back to Blog

Download the ressource