The Phantom Movement Problem

Joel Martinez · Principal Performance Engineer, Reddit · 12 years at Google and YouTube

2026-04-02 · 8 min read

Before you can trust a metric to flag real regressions, you have to know how much it moves when nothing has changed. Most teams never measure that floor, and they pay for it every week.

Most performance teams have at least one. The metric everyone on the team knows "just does that sometimes." It jitters a few percent day-over-day for reasons nobody can quite pin down. Someone made a note about it eighteen months ago, someone else promised to investigate, and it still jitters. When someone new to the team points at a spike, the veterans exchange a look and say "yeah, that one just moves around, don't worry about it."

The common story about noisy metrics is that they hide real regressions. That happens. The bigger cost, the one that shows up first and compounds fastest, is the engineering time a team burns investigating phantom movements. Someone sees a 12% drop, pulls five engineers into a war room, and three days later the team concludes it was probably just the thing it does. That investigation produced nothing because there was nothing to find. It happens again next week, and the week after. Multiply across a year and the noise has eaten an entire headcount worth of work, none of it shipping a fix.

None of that investigation budget should have been spent. A metric enters trusted use only after it has passed an A/A test and has a documented noise floor. The team running the metric should be able to point at a single number and say "movement below this is noise, movement above this is real." Most teams do not run that test on anything, which is why most teams have a dashboard everyone knows just does that sometimes.

A rational engineering trade-off

At a large consumer platform, we needed a crash metric. Measuring client-side crashes on the web is structurally much harder than on the server. A server process that dies gets caught by another process and written to a crash log. A browser tab that crashes disappears from the network, and there is no reliable way for it to phone home and say so. We needed a number we could trust to move when the product got less stable, and the direct approach was not available to us.

We made a rational engineering trade-off. We built a heuristic that inferred crashes from signals like abnormal session termination and missing heartbeat events. It was the best answer available within the constraints we had, and it solved a problem the platform's native crash reporting could not solve. It was also a trust failure, and the failure had nothing to do with whether the heuristic was well-designed and everything to do with how much it moved for reasons that were not crashes.

The heuristic caught real crashes. It also caught a parade of things that had nothing to do with crashes: logging pipeline hiccups, network flakes, users closing laptops on airplanes, a retry loop that landed events in the wrong bucket for a week. Every launch review turned into a two-hour debate about whether the number that had just moved was signal or noise, and nobody in the room could produce a principled answer. We missed real regressions in the noise. What hurt more was the thousands of engineer-hours we spent investigating movements that turned out to be nothing.

Nobody knew the no-change floor

The diagnosis is always the same. The metric was shipped into trusted use without anyone having measured how much it moved when nothing had changed. Nobody had a number for "minimum movement that counts as real." Every new data point was a debate because every new data point could have been a real regression or could have been the metric doing what it always did, and there was no principled way to tell them apart. When a metric's jitter floor is unknown, every movement is a coin flip, and humans are very bad at resisting the pull to investigate a coin flip that might be important.

The previous post in this series talked about metric design, and one of the rules was about surviving your weirdest users: percentiles, clipping, session-windowing. That handles noise across the population. Phantom movement is noise across time. A metric can survive every pathological user in your population and still wobble eight percent day-over-day because the pipeline is non-deterministic or the aggregation has a bucket boundary in the wrong place. Both problems have to be solved, and solving one does not solve the other.

The validation test is an A/A test

Before a metric enters trusted use, run it through your A/B platform as a pure control-versus-control experiment. Same code on both arms, same config, random bucketing. Let it run until it has reached your platform's standard stopping criteria, the same sample size you would require for a real experiment at your standard significance level. If the metric comes back statistically significant, that is a false positive by construction. Nothing changed, so any stat sig result is the metric reporting a difference that does not exist.

Evaluate each metric on its own repeated A/A runs, not by scanning a multi-metric battery for a single hit. At a 5% threshold one metric in twenty will land stat sig by chance on any given run, and that chance hit is not phantom movement, just the base rate of the test. Run the test enough times per metric and you get an empirical distribution of how much it moves in no-change conditions, and that distribution is its noise floor.

Anything that moves within it is indistinguishable from nothing, and you should not spend a single engineer-hour investigating it.

Historical jitter analysis is a reasonable corroborator. Pull the metric during a known-quiet period, compute day-over-day variance, and you get a rough estimate of normal wobble. The problem is that historical analysis has a chicken-and-egg trap built into it. Teams usually go looking at historical variance because they already saw the metric do something weird, which means they had already lost trust in the metric before the analysis started. An A/A test runs before the metric is trusted at all and establishes the floor proactively. Historical analysis is defense in depth, not the primary gate.

Where the wobble comes from

Pipeline non-determinism is the hardest of the three to diagnose, because the metric is actually doing what it was built to do. Sampling, retry logic, out-of-order event arrival, race conditions in the aggregation job. The metric's reading depends on details of how the pipeline happened to process events on a given day, and two runs over the same traffic produce different numbers. The difference between those runs is a noise floor you can never fall below. Any real regression smaller than that floor is invisible, and any apparent movement smaller than that floor is not a regression at all.

Aggregation artifacts are more subtle than pipeline problems and easier to overlook. Sparse-bin noise, bucket boundary effects, histogram quantization. The shape of the aggregation itself creates wobble independent of the underlying data. A P99 computed over ten thousand events per day is noisy in ways a P99 computed over ten million events per day is not, and the smaller aggregation will move day-over-day for structural statistical reasons even if user behavior is identical. A team that does not understand the math will spend weeks chasing the movement.

Second-order measurement traps are the most insidious of the three. When a metric is inferring something from a proxy rather than measuring it directly, anything that moves the proxy without moving the underlying reality creates phantom movement. A logging infrastructure change, a new retry loop in the client, a scraper that started executing JavaScript, a change in how the network stack reports timeouts. None of these are changes to user experience, and all of them move the proxy and make the metric lie. Second-order metrics should be treated as guilty until proven innocent by an A/A test, and kept under heavier scrutiny than direct measurements even after they pass.

Promote, or delete

A metric does not enter the list of trusted metrics until it has passed an A/A test and has a documented noise floor. The floor lives next to the runbook. Every future debate about whether a movement is real has an explicit threshold to compare against, and engineers can look at a chart and know within a second whether the movement on it is worth their time. A metric that fails its A/A test gets fixed or deleted before it poisons the team's investigation budget. A metric that is kept around despite failing is a liability, and the team will pay for it every week in chasing ghosts until someone has the stomach to pull it.

Pick the metric on your team's dashboard that everyone knows just does that sometimes. How many hours did engineers spend arguing about it in the last quarter? For each of the metrics your team actually trusts, can you produce its noise floor as a single number? When was the last time one of them was run through an A/A test, and when was the last time one failed? If none of those questions have answers, the metric that fires during next week's incident is going to produce the same debate it produced last time, and it is going to take the same number of engineer-hours to resolve.