Last-click attribution tells you content does not work, right when it is starting to. Here is the lagged, assisted content attribution model we run instead.

The problem ICE solves
Every CRO program collects more ideas than it can ship. A heatmap throws off ten, a session replay throws off five, the founder has three, and support has a list of things customers complain about. Without a scoring method, the backlog gets worked by whoever shouts loudest or whatever is easiest to build, and the highest-value tests sit untouched for months.
ICE is the lightweight method we use for CRO experiment prioritization. It scores each idea on Impact, Confidence, and Ease, then ranks the queue by the combined score. It is not perfect math. It is a forcing function that makes the team argue about the right things before anyone writes a line of code.
Scoring impact honestly
Impact is the expected lift if the test wins, scored 1 to 10. The discipline here is anchoring it to revenue, not to vanity.
We tie impact to two things: how far up the funnel the change sits and how much traffic touches it. A test on the product page of a top-five seller scores higher than a test on a footer link, because the product page sits closer to the money and gets more traffic. We force the question: if this wins by the lift we expect, how many dollars does that represent per month. An idea that affects 2% of sessions cannot score an 8 on impact no matter how clever it is.
The common failure is everyone scoring their own pet idea a 9. We counter it by requiring a one-line revenue estimate next to every impact score. The number does not need to be precise. It needs to exist, because the act of writing it down deflates most of the inflated scores on its own.
Scoring confidence honestly
Confidence is how sure we are the test will win, scored 1 to 10. This is the score people abuse most, so we tie it to evidence.
We ask what evidence supports this idea, and the score follows the strength of that evidence. A change backed by quantitative data, like a 60% mobile checkout drop-off we can see in analytics, plus session replays showing the cause, scores high. A change backed by a single person's opinion or a best practice from a blog scores a 3 or 4. A change we have already won in a similar test on another account scores higher still, because prior wins are real evidence.
We write the evidence next to the score, the same way we write the revenue estimate next to impact. No evidence, no high confidence. This kills the most expensive mistake in CRO, which is running underpowered tests on weak hunches and burning weeks of traffic to learn nothing.
Scoring ease honestly
Ease is how cheap the test is to build and run, scored 1 to 10, where 10 is trivial. This is the most concrete of the three, so it is the least abused, but there are two traps.
First, ease has to include the full cost: design, build, QA, and the traffic time to reach significance. A test that takes a day to build but six weeks to reach significance on low traffic is not easy. Second, ease is not importance. A high-ease, low-impact test should not jump the queue just because it is quick. The combined score handles this, as long as we score each axis on its own.
We also fold reversibility into ease. A test we can ship behind a feature flag and pull in five minutes if it tanks is easier than one that requires a code deploy to undo, because the cost of a wrong call is lower. This matters for the higher-impact tests, where the downside of a loser that runs too long is real lost revenue, and a fast kill switch is worth a point of ease on its own.
Working the queue
The score is the start, not the answer. Here is how we actually run it.
We compute the ICE score as the average of the three, then sort the backlog descending. The top of the list is the candidate set for the next sprint. We do not blindly take the top item, because a single number cannot capture traffic constraints. If two high-scoring tests both need the same checkout traffic to reach significance, we cannot run both at once, so we sequence them.
We cap concurrent tests so they do not contaminate each other. On most accounts that means one major test per high-traffic template at a time. Running two checkout tests simultaneously means neither result is clean.
We re-score the backlog every sprint, because confidence changes as we learn. A losing test often raises confidence on a related idea, or kills three ideas that shared the same assumption. The backlog is a living document, and a six-month-old ICE score is stale.
We also keep a hard rule on test power. Before a test goes live, we estimate the sample size needed to detect the expected lift at 95% confidence, and if the traffic cannot get there in a reasonable window, the test goes back to the backlog regardless of its ICE score. A test you cannot read is a test you should not run.
There is a second discipline that pairs with the power check: we write the hypothesis and the success metric before the test ships, in one sentence each. The hypothesis names what we are changing and why we think it will move the number. The success metric names the single metric we will judge it on, decided in advance. This stops the most common way CRO programs fool themselves, which is shipping a test, watching it lose on the primary metric, then going hunting through secondary metrics for any segment where it happened to win and declaring victory. The metric is locked before the data comes in, so a loss is a loss and we learn from it.
We log every result, win or lose, in the same place as the backlog. A losing test is not a failure, it is information that re-scores the queue. Over a few quarters the log becomes the most valuable asset in the program, because it tells us which kinds of changes tend to win on this specific brand, which raises confidence scores on related ideas with real evidence instead of hunches.
A concrete example
A fashion ecommerce client gave us a backlog of 34 ideas and a strong opinion about which to run first, a homepage hero redesign the founder loved. We scored the full backlog.
The hero redesign came out at impact 5, confidence 4, ease 3, for an ICE of 4.0. Low confidence because the only evidence was preference, and low ease because it was a full design and build. The top of our ranked list was something nobody had championed: a mobile checkout that showed shipping cost only on the final step, where analytics showed a 58% drop-off. Impact 8, confidence 8 backed by the analytics and three session replays, ease 7 because it was a small template change. ICE of 7.7.
We ran the checkout test first. It lifted mobile conversion 14% and reached significance in 11 days. The hero redesign, when we eventually ran it, was flat. ICE did not predict the outcome perfectly, but it put the test that mattered at the front of the queue instead of the test that was loudest.
How we think about it
ICE is not a model that tells you the truth. It is a shared language that makes a team score the same ideas the same way and argue about evidence instead of opinions. The value is not the number. It is the conversation the number forces, and the queue discipline that comes after. We pair it with a hard power check so we never ship a test we cannot read, and we re-score every sprint because the backlog learns as we do.
The scoring rubric and the queue rules are written up at https://www.arthea.ai/article/cro-experiment-backlog-ice.




Architecture Notes
Occasional insights on infrastructure, conversion systems, retention architecture, and AI deployment, shared when they’re worth reading.




