Idempotency keys: why every automation needs them | AI Automation & Growth Insights

June 6, 2026

The duplicate that cost a client three refunds

A payment webhook fired, the network blinked, the sender retried, and a client's customer got charged twice. The automation did exactly what it was told. It received a valid payment event and created a charge. It just received the same event twice, and nothing in the flow knew that. We issued three refunds that week before we traced it, and then we did what we should have done at the start. We added idempotency keys.

An idempotency key is a simple promise: running the same operation twice produces the same result as running it once. The second call is a no-op. Once you build for that, retries stop being scary, webhooks can fire as many times as they like, and a replayed dead-letter item cannot double anything. Idempotency is the foundation that makes retry and replay safe, which is why we treat it as a requirement and not a nicety.

Why retries make duplicates inevitable

Every system that delivers messages over a network delivers them at least once, not exactly once. Stripe retries webhooks. n8n retries failed nodes. Queues redeliver when a consumer dies mid-process. Browsers re-fire form submissions when someone double-taps. None of these are bugs. At-least-once delivery is the honest default, because the alternative, guaranteeing exactly-once across an unreliable network, is effectively impossible. The sender cannot know whether the receiver got the message or whether the acknowledgment got lost on the way back, so it retries to be safe.

That means the receiver, your automation, is the only place duplicate protection can live. You cannot ask the network to stop sending twice. You have to make the second arrival harmless.

Designing the key

The key has to be stable for the same logical operation and different for genuinely different ones. Getting this wrong in either direction breaks things. Too unique and every retry looks new, so you get duplicates anyway. Not unique enough and two real operations collide, so you silently drop a legitimate one.

We derive the key from the meaning of the event, not from when it arrived. For a payment, the key is the payment provider's event id, which is identical across every retry of that event. For a form submission, we hash the form id, the submitter email, and a one-minute time bucket, so a frantic double-tap collapses but a genuine resubmission an hour later goes through. For a generated email, the key is the recipient plus the campaign id plus the send date. When the upstream system gives us a real unique id, we use it directly. When it does not, we build a deterministic hash from the fields that define sameness. We never use a random UUID generated at receipt time, because a random value is different on every retry and defeats the entire purpose.

Where the key lives and how the check works

The key needs a durable store that every retry can read. We use a Postgres table, processed_keys, with the key as a unique primary key and a created_at timestamp. The flow does one thing before any side effect: it tries to insert the key. If the insert succeeds, this is the first time we have seen this operation, and we proceed. If the insert fails on the unique constraint, we have processed this already, and we stop and return the stored result.

The order matters enormously. The check and the side effect cannot be two separate steps with a gap between them, or two concurrent retries can both pass the check before either writes. We rely on the database's unique constraint as the atomic gate. The insert is the lock. For operations that must return the original result on a replay, not just skip, we store the result alongside the key so the duplicate call gets the same answer the first call produced. That is what makes a replayed dead-letter item truly safe: it returns the original outcome instead of redoing the work.

We also expire keys on a schedule. A processed_keys table that grows forever becomes a problem of its own. We keep payment and order keys for 90 days and ephemeral keys, like form-submission buckets, for 7 days, then prune. The retention has to outlast the longest realistic retry window of every upstream sender, which for most webhook providers is around three days. We set the floor at three times that to be safe, then size the rest by how long we might plausibly need to debug or replay a given operation. A payment we might need to investigate weeks later, so it lives for a quarter. A double-tap guard on a contact form is useless after a minute, so its row can vanish in a week.

One detail that bites teams: the prune job itself must be idempotent and safe to run concurrently. We have seen a cleanup cron overlap with itself on a slow night and try to delete rows another instance was mid-reading. We gate the prune behind a single advisory lock so only one instance runs it at a time, and it deletes in small batches rather than one giant statement that locks the table. The same discipline that protects the writes protects the housekeeping.

The concurrency trap that catches most teams

The failure that survives a naive idempotency check is two retries arriving at the same instant. Picture the check written as two steps: first query whether the key exists, then, if it does not, do the work and record the key. Two concurrent retries both run the query, both see no key, both proceed, and you have your duplicate back. The window is milliseconds wide, but at any real volume those milliseconds get hit.

This is exactly why we insert first and let the database reject the second writer. The unique constraint is enforced atomically inside the database engine, so even if two requests insert at the literally same moment, one wins and the other gets a constraint violation. There is no window. We catch that violation, recognize it as a duplicate, and return the stored result instead of erroring. The lesson we paid for: never implement idempotency as check-then-act in application code. Let the database be the gate, because it is the only component that can make the decision atomically under concurrency.

Idempotency on the sending side too

Most of this is about the receiver, but there is a sending-side discipline that pairs with it. When our automations call an external API that supports idempotency keys, and many do, Stripe and several payment processors accept an Idempotency-Key header, we generate a deterministic key for the call and pass it along. That way, if our own retry logic fires the same outbound request twice, the receiving service dedupes it for us. We send the header on every write call even to services we are not sure honor it, because it costs nothing and it makes us safe the day they add support. The system as a whole is only as safe as its weakest write, so we make every write idempotent in both directions.

The rule we apply everywhere

Any operation with a side effect that a customer would notice if it happened twice gets an idempotency key. Charges, emails, SMS, CRM record creation, inventory decrements, anything that spends money or creates state. Read-only steps do not need one, because reading twice is already harmless.

The payment double-charge never happened again after we shipped this. More than that, it changed how aggressively we could retry everywhere else. Once every write is idempotent, you can crank retry attempts up, replay an entire dead-letter batch without auditing each item, and let webhooks fire as often as the sender wants, because the worst case is wasted work rather than duplicated harm. That is the quiet payoff. Idempotency does not just prevent one bug. It makes the entire system safe to be aggressive with. If you want the processed_keys schema and the insert-first pattern we use across our flows, we are at arthea.ai.

Go back to Blog

Last-click attribution tells you content does not work, right when it is starting to. Here is the lagged, assisted content attribution model we run instead.

Running SMS and email orchestration as separate programs trains people to ignore both. Here is the channel-priority, frequency-cap, and cross-suppression logic we run.

Our median lead response went from 6 hours to 90 seconds. Here is the n8n automated lead routing workflow: capture, enrich, score, assign, alert, with real timings.

Architecture Notes

Occasional insights on infrastructure, conversion systems, retention architecture, and AI deployment, shared when they’re worth reading.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.