The on-call rotation we built for our AI agents

June 6, 2026

The failure mode nobody monitors

When we started running autonomous agents in production, we instrumented them the way we instrument any service. Error rates, exception traces, HTTP status codes. For the first few weeks it looked clean. Almost no errors. Then a client asked why their weekly report had not gone out in nine days.

The agent had not crashed. It had not thrown a single exception. It had quietly stopped producing output. An upstream n8n workflow returned an empty array instead of an error, the agent treated the empty array as a valid result, and the whole chain went dark without a single alarm.

That is the core lesson behind AI agent reliability. Traditional monitoring assumes failure looks like an error. Agents fail by going silent. They return polite, plausible nothing. So we stopped paging on errors alone and built a rotation that pages on silence.

Why agents fail quietly

It helps to understand why this happens, because the cause shapes the fix. A normal service has a narrow contract. Given a request, return a result or raise an error. An agent's contract is wider and softer. It takes ambiguous input, calls a chain of tools, and produces output that is judged on quality rather than on a status code. Every link in that chain can succeed mechanically while the chain as a whole produces nothing of value.

The model itself almost never errors. Ask it to summarize an empty document and it will cheerfully summarize the absence of content. Ask it to classify a reply that never arrived and it will pick the most likely label from thin air. The tools around it behave the same way. An API that ran out of quota returns a 200 with an empty body more often than it returns a clean 429. A workflow with a misconfigured filter returns zero rows, which is a perfectly valid response that happens to be wrong. None of this trips an exception handler. All of it produces output that looks like success and contains nothing.

That is why error-rate dashboards lie about agent health. They measure the one failure mode agents rarely hit and ignore the one they hit constantly.

Three signals per agent

Every one of our 83 agents now reports three signals into a `cron_runs` table that the on-call dashboard reads every minute.

The first signal is last good output. Not last run. Last run that produced a result we accept as real. An agent can execute on schedule, log success, and still emit garbage. So we record a timestamp only when the output passes its own validation gate. If the cold-reply-classifier has not tagged a single inbound reply in 6 hours during a workday, that is anomalous for its volume, and we want to know before the client does.

The second signal is output rate against baseline. We keep a rolling 14-day median of how much each agent produces per active hour. The sourcing agent normally adds 40 to 60 leads per run. When it drops to 3, the run did not fail. Apollo changed a response shape, or an ICP filter got too tight. A rate that falls below 30 percent of baseline pages the on-call engineer.

The third signal is dispatch-to-result gap. We stamp the moment Atlas dispatches a job to n8n and the moment the result lands back. Healthy agents close that loop in seconds to a couple of minutes. When the gap crosses the agent's own stale window, default 5 minutes, the run is marked stale and the next tick skips it rather than piling a second invocation on top. A widening gap across several agents at once usually means n8n itself is degraded, not any single agent.

How the page actually fires

The rotation is a cron that runs every minute and joins the three signals against each agent's expected cadence. We do not page on a single bad data point, because agents are noisy and a one-off empty result is normal. We page when a signal stays bad across a window long enough that no healthy agent would.

Concretely: last good output older than 2x the agent's scheduled interval, or output rate under 30 percent of the 14-day median for two consecutive active hours, or a dispatch-to-result gap past the stale window on more than three agents simultaneously. Any of those crosses the threshold and the reliability agent gets paged first. If it cannot self-resolve in 2 minutes, the page escalates to the human on call.

Severity maps to blast radius. One quiet agent is a low page handled inside the next standup. The dispatch-gap signal lighting up across the fleet is a SEV-1 that pages a human immediately, because that pattern means the bus connecting Atlas to n8n is down and every downstream client deliverable is at risk.

A runbook per failure class

A page is only useful if the person who gets it knows what to do at 2am. So every alert links to a runbook keyed to the failure class, not the agent. We found that agents fail in a small number of repeatable ways, and the fix rarely depends on which agent surfaced it.

Silent-output failures point to the upstream-empty-result runbook: check whether the data source returned an empty set, replay the last dispatch with logging on, and confirm the validation gate is rejecting correctly rather than the source being dry. Rate-collapse failures point to the baseline-drift runbook: diff the current input filters against last week, check provider quota, look for a schema change in the API response. Dispatch-gap failures point to the n8n-degraded runbook from our incident playbook: probe the n8n health endpoint, check Redis and the queue depth, and if the bus is down, pause non-critical crons so the backlog does not stampede on recovery.

Each runbook ends with the same two lines. What to verify before you close the page, and what to capture for the post-mortem if it was a SEV-1.

Keying runbooks to failure class instead of agent is the detail that keeps the rotation maintainable at 83 agents. If we wrote one runbook per agent we would have 83 documents drifting out of date. Instead we have a handful, one per way an agent can fail, and a new agent inherits all of them for free the moment it starts reporting the three signals. The agent that surfaced the alert is just context in the page. The fix lives in the class.

Tuning the thresholds without alert fatigue

The hardest part of building this was not the signals. It was the thresholds. Page too eagerly and the on-call engineer learns to ignore the pager, which is worse than no pager at all. Page too late and you are back to hearing about outages from clients.

We tune three knobs per agent. The staleness multiple, the rate floor, and the consecutive-window count. A high-cadence agent like the dispatcher, which runs every 5 minutes, gets a tight staleness multiple because two missed runs is already 10 minutes of dark. A low-cadence agent like the weekly report cron gets a generous one, because a single skipped run is not yet a problem. The rate floor sits at 30 percent of baseline for most agents, but we raised it for revenue-critical agents and lowered it for naturally bursty ones whose output is lumpy by design.

The consecutive-window count is what kills false positives. A single bad data point never pages. The signal has to stay bad across two or more checks, which filters out the normal noise of an agent that happened to have a quiet hour. We landed on these numbers by replaying two weeks of historical data against candidate thresholds and counting how many real incidents each would have caught versus how many false pages it would have fired. We picked the settings that caught every real incident with the fewest false alarms, and we revisit them whenever an agent's baseline shifts.

What this caught

In the first month after we shipped the silence-aware rotation, it caught four incidents that the old error-based monitoring would have missed entirely. An expired API token that returned 200 with an empty body. A drip dispatcher that stopped seeding new sequence state after a launch gate quietly rejected a list. An enrichment agent that kept running but produced empty icebreakers after a prompt change. And the original culprit, the empty-array report chain, which we now catch in under an hour instead of nine days.

None of those threw an error. All of them would have reached a client. The shift that mattered was simple to state and harder to build. Stop asking whether the agent ran. Start asking whether the agent did anything worth running for, and page when the answer is no.

If you are running agents in production and your dashboard is green while your output is quietly dropping, you are monitoring the wrong thing. Watch last good output, output rate, and the dispatch-to-result gap, and page on silence. That is the difference between finding out from your alerting and finding out from your client.