The Anatomy of a Trusted Metric

Joel Martinez · Principal Performance Engineer, Reddit · 12 years at Google and YouTube

2026-03-25 · 10 min read

Once a metric loses trust, it does not recover. It still produces numbers, those numbers still land on dashboards, and nobody acts on them. Engineers stop investigating when it spikes. Leaders stop referencing it in reviews. A new metric gets built on top of the old one, the old one becomes wallpaper, and the cycle starts over.

After a decade working on performance metrics at scale, I have landed on five rules that separate a metric that earns trust from one that just exists. A metric that breaks any of them will not hold up for long.

Rule 1: A trusted metric measures exactly one thing, directly

A trusted metric tells you what you want to know, and when it moves, it tells you why. There are two ways to break that rule: measure something adjacent and hope it correlates, or define the metric through so many layered rules that a movement tells you nothing about which rule tripped.

Proxies are the more seductive failure, because the reasoning always sounds fine in the moment. You want X, X is hard to measure, Y is sitting next to X usually tracking it, so you reach for Y. At a large consumer platform, bounce rate was the metric in that trap. We had no clean browser signal for "user gave up and left," so we built one: count URL requests that never produced a corresponding server-side log for a successful page render. The reasoning was clean: if you asked for a page and we never logged serving it, you bounced. The trouble is that a lot of things produce that mismatch besides users giving up. Bots requesting URLs in weird ways produced it. Users who abandoned within a hundred milliseconds got counted identically to users who waited ten seconds for the page to paint and then gave up, which are two entirely different stories collapsed into one number. The metric spiked whenever a bot ring hit a new crawl pattern or a regional CDN had a latency blip, and every time it moved we burned an afternoon arguing about whether the product had regressed or the scrapers had gotten cleverer.

Perceived performance fails the same way. You want to know how fast the page feels, so you reach for a browser timing event like First Contentful Paint (FCP) or window.onload. Those events measure when the browser hits a milestone in the load pipeline. They do not tell you whether the page feels fast. A page can have a great FCP and still have a hero image that paints immediately but takes six seconds to become scrollable, or a button that shows up instantly and takes a full second of JavaScript to wire up. You can optimize those timing events for a quarter, ship a measurable win on every dashboard, and no user will notice the page is faster.

Convoluted metrics fail a different way. The metric is technically direct, but defined with so much internal machinery that when it moves, you cannot reason about what actually changed. Cumulative Layout Shift (CLS) is a good example. To compute it, you group every unexpected layout shift across the page's lifetime into session windows up to five seconds long with no more than a one-second gap between shifts, score each shift by how much of the viewport moved and how far, sum the scores in each window, and report the largest window as the page's CLS. Not only is this hard to reason about, but when CLS regresses, the movement could have come from any shift in any session window at any point in the page's life, buried in any component anywhere in the UI. Interaction to Next Paint (INP) has the same problem in a different shape. It reports the 98th-percentile user interaction latency across an entire visit. A regression tells you one of hundreds of interactions on the page got slower. It does not tell you which one, or when, or in what flow, which in a client-heavy SPA is most of what you need to know to fix it.

Both failures land in the same place. The metric moves, nobody has an immediate read on what happened, and the investigation starts from nothing. Do that a few times and the metric has lost the room.

Rule 2: A trusted metric comes with a runbook

A metric regression without a runbook is just an alert with no next step. It fires, the team knows something is wrong, and every investigation starts from scratch. The runbook is what turns a spike into a diagnosis. It names the secondary signals to check, in what order, and what each one implies about where the problem actually lives.

Largest Contentful Paint (LCP) is a useful metric to think about here. When LCP regresses, there is a finite set of candidates: JavaScript execution time grew, the largest image got larger, font loading slowed down, the critical CSS changed, the server response time went up. Each one of those has its own signal you can pull up, and each one points to a different owner. A runbook for LCP is exactly that list, written down, with the dashboards to check for each candidate. When the metric fires at 2am, the on-call engineer does not have to remember which layer moved last.

The runbook is also a forcing function on the metric designer. If you sit down to write the runbook and cannot name the secondary signals, the metric is too abstract to be actionable. It measures something real but nothing anyone can act on. That is the signal to either decompose the metric into something tighter or go build the secondary signals first, before you ship the top-line number.

Rule 3: A trusted metric survives your weirdest users

At scale, your real user population is weirder than you think. Some people open a tab and walk away for six hours. Some have a system clock that says 1997. Some are running on an old Android on a 3G connection in a rural market, and some are on a new MacBook on fiber. Some are not users at all. They are datacenter bots, synthetic load tests, your own QA team, scrapers polite enough to execute JavaScript. A metric that treats all of them equally will get moved around by whichever subpopulation shifts fastest, and it will almost never be the one you care about.

The first line of defense is percentiles instead of averages. The average is a lie in any distribution with a long tail, which describes essentially every real user metric. A P50, P75, and P99 tell you where the median user sits, where the first unhappy cohort starts, and where the pathological tail lives. They also let you tell the difference between a regression that hurts everyone a little and a regression that destroys one percent of sessions while leaving the rest alone. Those two failures may look identical on an average and nothing alike on percentiles, and they have very different fixes.

The second line is session-windowing and outlier clipping. Sum metrics like CLS have to cap at a reasonable window so a tab left open for a week does not accumulate a nonsense value. Duration metrics clip at a ceiling so the user who went to lunch with the tab open does not drag your P99 to infinity. Bot traffic gets filtered out before the metric is computed, not after. A metric that does not do these things will produce answers that look like engineering bugs and are actually population shifts, and the team will burn weeks investigating the wrong thing.

Rule 4: A trusted metric is intuitive to reason about

A metric is only useful if the people who need to act on it can reason about it without being an expert. That has two halves. The engineer who owns the system should be able to predict which direction a given change will move the metric before they ship it. The PM or VP sitting in the launch review should be able to follow what the metric means in two minutes, with no prior context.

If the engineer cannot predict direction, the metric is a coin flip and every launch is a gamble. If leadership cannot follow the explanation, the metric never makes it into the rooms where budget decisions get made, which are the only rooms that actually matter for getting performance work funded. A concrete test: can you explain this metric to someone on the sales team and have them tell you back what it means in plain English? If the answer is no, the metric is going to live and die inside the engineering org and nothing outside will ever act on what it shows.

Intuitive does not mean simple. LCP is not a simple metric. Its definition involves element visibility, viewport geometry, and a cutoff rule that most engineers have to look up to implement correctly. But its mental model is simple: when did the biggest useful thing on the page show up? That is the part that has to be intuitive. The implementation can be as complicated as it needs to be, as long as the one-sentence explanation lands.

Rule 5: A trusted metric measures something worth acting on

This one is the quietest of the five and the one that kills the most metrics in practice. A metric that is isolated, runbooked, robust to outliers, and easy to reason about can still be worthless if it does not connect to anything the user or business cares about. Pure technical metrics fail this test most often. Bundle size is a good example. Nobody outside engineering cares about bundle size. People care about page load time, and they care about it because it drives engagement, and they care about engagement because it drives revenue. If you cannot walk the chain from your metric all the way to something a non-engineer cares about, you do not have a business case, and your metric will lose every budget fight it enters.

The chain has to be load-bearing at every link. Bundle size affects JavaScript parse time affects time-to-interactive affects bounce rate affects sessions affects revenue. Every link in that chain should have been measured at least once and have a known elasticity. A metric that only connects partway down the chain is still actionable, as long as the rest of it is already mapped from prior work. A metric that cannot reach the first link at all is a technical curiosity. Keep it for debugging, but do not put it on the weekly review and expect anyone outside engineering to care.

Five rules, one underlying principle

A trusted metric ends arguments. When the number moves, everyone agrees on what it means, agrees on whether the movement matters, and agrees on what to look at next. The conversation skips the data debate entirely and moves straight to the fix.

Pick one of the metrics your team looks at every week. Walk it through the five rules. Does it measure exactly one thing, directly? Does it have a runbook? Does it survive your weirdest users? Can a non-engineer follow it in two minutes? Does it connect to something the user or business cares about? If it fails even one of them, you already know what next week's debate is going to sound like.