Performance Tip of the Week #98: Measurement has an ROI

Originally posted as Fast TotW #98 on August 23, 2025

Updated 2025-10-03

Effectively measuring optimization projects is an important part of the lifecycle of an optimization. Overlooking a large positive (or negative) externality can cause us to make the wrong decisions for choosing our next steps and future projects. Nevertheless, this quest for accuracy needs to be balanced against the ROI from a better measurement. In this episode, we discuss strategies for deciding when to invest more time in measuring a project and when to move on.

Getting the big picture right

In choosing a measurement strategy, we want to get the big picture right. We want to look for where a difference changes our decisions for a project: Do we roll back because something wasn’t really helpful, the complexity unwarranted, or do we double down and keep working in the same space following success?

Spending a small amount of effort to realize a 2x difference in our measured results can have a tremendously positive ROI: The extra measurement time is easier than finding, developing, and landing another equally-sized optimization from scratch. Conversely, spending twice the effort to refine the measurement of a much smaller optimization has a poor ROI, as the added precision is lost in the noise of larger effects.

It is easy to fixate on the numbers that we have on the page while overlooking the numbers that are off the page. It is more important that we don’t overlook large positive (or negative) externalities that change our conclusions than to eke out a 7th digit of precision.

Significant figures

Optimization outcomes are quantitative, which can make them an attractive nuisance for believing they are overly precise and accurate. For many techniques, we can only measure 1 or 2 significant digits.

With a tool like Google-Wide Profiling, we can represent the fraction of Google fleet cycles in TCMalloc as a 17 decimal digit, double-precision floating point number. Writing out this many digits doesn’t actually mean the number has that precision. Spending a few additional milliseconds in TCMalloc on a single server will change some of the digits in that number. While the overall trend is stable, there is a day-to-day standard deviation in the total. A second-order effect with small magnitude relative to that variation might be entirely lost in the noise.

Aggregating more days can give us more samples, but longer longitudinal studies can experience confounding factors from unrelated changes in the environment if the true effect size is small. We may have achieved a more precise measurement, but changes to the environment will negatively impact our accuracy beyond the added precision.

Precision and accuracy

We want to avoid confusing precision for accuracy, and vice-versa. Similar to how a long duration analysis might be stymied by confounding factors, a carefully controlled experiment might claim a high precision result without being accurate.

Many load tests repeatedly replay traffic and compare the performance between a modified and baseline configuration. These repeated runs can deliver large sample sizes to produce low standard deviation estimates, instilling a (potentially false) belief that the result is accurate. By construction, load tests are simplifications of production and so they may omit certain workload types, platforms, or environment features that would be seen in widespread deployment.

Even with tools like production A/B experiments, which can help give us accurate results by testing against the full variety of workloads, we still must account for nuances in the technique:

Our sample has to be representative. For example, an A/B test in a single data center will encounter a different subset of workloads than the fleet as a whole.
Our control and experiment groups have to be sufficiently isolated from each other.
Our experimental setup needs to avoid statistical pitfalls.

Spending time to account for these factors makes sense if we’re making real gains in accuracy: Large scale data locality and contention effects can be challenging to measure by other means. For small changes without these externalities, we may expend a lot of effort to measure without obtaining a better measurement.

Increasing statistical power through aggregation

Small changes create agility by breaking them into manageable, focused chunks, which aids testing, debugging, and review. This can complicate a measurement story, since it adds more pieces that we need to track.

Rather than allow the measurement tail to wag the dog by forcing us to combine unrelated optimizations or forgo them entirely, we may opt to track the aggregate benefit from a series of optimizations made over a brief period of time.

As we iteratively optimized TCMalloc’s fast path, we could track the overall trend and the performance impact against the baseline. Each individual change could be evaluated through inspection of assembly or microbenchmarks.
During the SwissMap migration, we could track the overall trend in associative container costs. Based on the overwhelming performance results for early deployments, each individual change could be primarily evaluated for correctness with careful benchmarking reserved for only a fraction of the changes in the migration.

Even if we’re pursuing a general strategy of aggregating similar changes, we may want to separate out the effects of more tradeoff-heavy optimizations. A change that creates subtle edge cases or long-lived technical debt might be worth it if it’s a sufficiently large optimization. If we don’t end up realizing the benefits, we might prefer to roll it back in favor of looking for simpler approaches.

Balancing complexity and simplicity

Aiming for greater accuracy can tempt us to pursue more complex methodologies. An approach that is easier to implement, but captures most of the impact and regression signals from our project, can afford us more time to work on the actual optimizations, look for externalities, and be just as accurate in the end.

Importantly, a simple measurement should not be confused with an incomplete one. The goal is to avoid needless precision, not to overlook significant impacts like transitive costs on other systems or shifts between different resource dimensions.

The more complex approach that captures small effects can draw attention to its details, causing us to overlook the larger effects that we missed. Even though we explicitly made a good faith attempt to quantify everything, it is easier to focus on the words on the page than the words absent from it.
While we might capture additional phenomena relevant to performance, the additional factors we consider introduce new sources of error for us. Joining multiple data sources can be tempting; but if we are not familiar with their idiosyncrasies, we might end up with surprising (or wrong) results.
Simple is explainable. “This performance optimization only has an effect during Australian business hours” might be completely correct due to diurnal effects, but it is harder to see the immediate connection between what our approach left implicit and a causal explanation.

When we consider additional factors for measuring a project, we should try to gauge the contribution of each relative to the significant figures provided by the others. If one signal contributes 1% +/- 0.1pp, it will overwhelm another term that contributes 0.01% +/- 0.001pp.

Conclusion

In choosing a measurement strategy, we want to strike a balance between precision, accuracy, and effort. Accurate measurements can help us to continue pursuing fruitful areas for adjacent optimizations, but we should be mindful where increasing effort produces diminishing returns.

Performance Guide

Performance Hints

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #26: Fixing things with hashtable profiling

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate

Fast TotW #93: Robots never sleep

Fast TotW #94: Decision making in a data-imperfect world

Fast TotW #95: Spooky action at a distance

Fast TotW #97: Virtuous ecosystem cycles

Fast TotW #98: Measurement has an ROI

Fast TotW #99: Illuminating the processor core with llvm-mca