The 8-Layer Performance Engineering Operating System

Joel Martinez · Principal Performance Engineer, Reddit · 12 years at Google and YouTube

2026-03-15 · 9 min read

Performance engineering is a strategy problem. Most engineering organizations have decent telemetry. They have Datadog, they have Grafana, they have dashboards their VP shows on stage at all-hands. What they do not have is a strategy built around any of it.

The gap is everything above monitoring. Performance has no single owner. Leadership has no dollar figure for a 100ms latency regression. Metrics exist but do not drive decisions, and performance gets prioritized reactively after an incident rather than proactively as a discipline.

These companies have plenty of data, but the work of turning that data into a system has just never happened.

The framework

What follows is the framework I use when I work on performance at scale. I call it the 8-Layer Performance Engineering Operating System, and it is the lens I apply to the entire discipline. Eight layers, each addressing one category of what separates a team that actually protects performance from a team that is just collecting telemetry.

Layers 1 through 6 are sequential process layers. Each one depends on the layers below it working correctly. Layers 7 and 8 run persistently in parallel with everything else, more substrate than stage.

This post is the table of contents for the rest of this site. Every deep dive that will publish here lives under one of the layers below, and every layer below is its own landing pad. As you read through them, you should be able to identify which one is broken at your own company, and which one you need to read next.

Layer 1: Metric Design

The craft of designing metrics that are actionable, intuitive, business-linked, and statistically sound. This layer covers what to measure and how to aggregate it so a single outlier session does not move your top-line number. It covers the hypothesis you state before implementing the metric so you know what a real regression looks like, and the promotion process that takes a new metric from alpha to final through real production experience.

Skip this layer and you end up with 200 dashboards and zero trusted numbers. At a previous employer we had a crash metric that nobody quite trusted, and it moved on nearly every experiment. It was a heuristic, a second-order signal that tried to infer crashes rather than measure them, and it caught logging bugs and abandoned sessions right alongside real crashes. Every launch turned into a debate over whether a regression was real. Does your team agree right now on which five metrics reflect user reality?

Layer 2: Metric Validation

A designed metric is a hypothesis, and validation is the process of proving it against real production experience. The metric should spike when a known incident hits, drop when a known fix lands, and move the right way when you run an experiment that should move it. Validation by contradiction is the same discipline in reverse: push a change you know should move the metric, and confirm that it does. Once it survives all of that, the question shifts from whether the metric works to whether it is diagnostic. When it moves you have to know why, and the metric itself never tells you. The designer has to name the secondary signals to check and what each one means before the metric ever fires. That is what a runbook is for.

Skip this layer and your first real incident becomes a crisis of confidence. The metric looks fine. Users are complaining. Nobody knows whether the metric is broken, the system is broken, or both. The last time a user-facing regression caught your team by surprise, did your top-line metric move when it should have? Trust is the only currency a metric has, and it gets earned here or nowhere.

Layer 3: Business Impact Quantification

What is performance actually worth to this business in dollars? Every company should know this number. Almost none do. The two methods that produce it are ablation studies, where you deliberately degrade performance for a controlled cohort and measure what happens to engagement and revenue, and historical analysis, where you mine past incidents and rollouts for natural experiments. The output is a single ratio and an elasticity curve that shows you where additional latency stops being absorbed and starts bleeding users.

Skip this layer and you walk into the budget meeting asking for headcount with P99 latency charts instead of dollar amounts. Leadership nods politely and funds something else. If someone asked you today what a 100ms latency regression costs your business in revenue, could you produce a number, or would the room go quiet? Performance never gets the investment it needs because nobody has ever connected it to revenue in a way a non-engineer can act on.

Layer 4: Tooling Completeness

The gap between having telemetry and having observability. Event logging and basic dashboards are the collection layer. Observability is the work above them, and it separates a team that can answer questions from one that just collects data. The foundation is two pieces: perf CI checks that fail the build when a commit regresses bundle size or breaches a latency target, and production monitoring that pages the right humans the moment a core metric spikes outside its normal range.

Above the foundation, the investigation tools come in roughly the order a team builds them. A dashboard that slices any metric by every key dimension surfaces unexpected cohorts. A drill-down tool automates that slicing across every dimension at once and surfaces the single slice moving an aggregate. A canonical experiment report gives every launch the same exec-readable summary. Session-based analysis comes last, the deepest tool of the set, reconstructing a user's full path through the product when nothing else can explain what is happening.

Skip this layer and your dashboards tell you something is wrong but not why. If your P99 spiked ten minutes ago, who on your team can write the SQL to find the exact cohort causing it? Diagnosis that lives in one engineer's SQL chops is neither repeatable nor reliable. The tooling layer exists to lift that knowledge out of one head and into a dashboard that gives every engineer the same answer.

Layer 5: A/B Experimentation

Without a statistically sound experimentation platform, you cannot safely ship anything non-trivial. The minimum viable stack is feature flag infrastructure that gates any code path and ramps exposure incrementally, with consistent bucketing into control and treatment arms. It requires automatic stat sig calculation on every key metric, and guardrail metrics that trip when something out of scope regresses. It defines a clear path from a statistically significant result to a 100% rollout.

Skip this layer and you ship blind. You cannot tell a real improvement from seasonal noise, you cannot catch a regression before it hits every user, and you cannot run the ablation study that Layer 3 depends on. When your team celebrates a launch win, how often can you prove it wasn't the usual weekly pattern? Every decision defaults to "we deployed it, and the graph went down, probably." You are guessing.

Layer 6: Governance and Institutional Embedding

Performance structurally baked into launch reviews, CI, code reviews, and design docs. Every trusted metric has a named owner, every launch review has an explicit performance sign-off, and every design doc has a performance considerations section. When a guardrail metric trips during an experiment, it is flagged automatically in the launch review rather than buried in a dashboard nobody reads. Every major production performance incident triggers a blameless postmortem with concrete action items, so the same failure does not happen twice.

Skip this layer and performance is someone's side job. Metrics exist but nobody looks at them when launches happen. Decisions default to whoever speaks loudest in the room that day. Who at your company is actually empowered to block a launch when a guardrail metric trips, and when was the last time they did it?

Layer 7: Data Literacy

The first of two persistent layers that run alongside everything else. Engineers do not need to be statisticians, but they need enough fluency to scrutinize and ask good questions of data. They need to understand why percentiles beat averages and how to read a confidence interval. They need a grip on experiment mechanics like arm imbalance and statistical power. Finally, they need the discipline to slice metrics by every meaningful dimension so a globally neutral number does not hide a badly regressed cohort.

Skip this layer and everything else produces outputs the audience cannot interpret. Engineers declare victory on noise. They ignore real regressions because they do not understand what a wide confidence interval actually allows them to conclude. A global metric looks flat and the investigation stops. The framework generates the right data and the team reads it wrong.

Layer 8: Culture

The second persistent layer, and the substrate everything else runs on. Culture is engineers thinking about performance unprompted, and it rarely arrives through a top-down mandate. It arrives through a performance philosophy that articulates why this work matters and what good looks like. It arrives through championship sponsors who carry the work to their subteams, public forums like tech talks and internal conferences, and senior leaders who say so out loud and often. Culture is what makes the other seven layers persist without one person constantly pushing them.

Skip this layer and every improvement requires top-down enforcement. Nobody opens a dashboard unless an executive asks. The work never compounds because the org never internalizes it. Culture is also tightly coupled to metric design: when metrics are untrustworthy, engineers form a quiet consensus to ignore them, and culture dies in the gap.

Which layer is broken at your company?

If you read through the list above and one of the failure modes sounded uncomfortably familiar, that is your starting point. The deep dives on this site are organized by layer, so as I publish more of them, you can walk directly to the one you need.

Start there and fix that layer. Then come back and look at what sits above it. Most orgs have more than one broken layer, but they almost always have a weakest one, and the rest of the framework is limited by it.