Originally posted as Fast TotW #98 on August 23, 2025
Updated 2025-10-03
Quicklink: abseil.io/fast/98
Effectively measuring optimization projects is an important part of the lifecycle of an optimization. Overlooking a large positive (or negative) externality can cause us to make the wrong decisions for choosing our next steps and future projects. Nevertheless, this quest for accuracy needs to be balanced against the ROI from a better measurement. In this episode, we discuss strategies for deciding when to invest more time in measuring a project and when to move on.
In choosing a measurement strategy, we want to get the big picture right. We want to look for where a difference changes our decisions for a project: Do we roll back because something wasn’t really helpful, the complexity unwarranted, or do we double down and keep working in the same space following success?
Spending a small amount of effort to realize a 2x difference in our measured results can have a tremendously positive ROI: The extra measurement time is easier than finding, developing, and landing another equally-sized optimization from scratch. Conversely, spending twice the effort to refine the measurement of a much smaller optimization has a poor ROI, as the added precision is lost in the noise of larger effects.
It is easy to fixate on the numbers that we have on the page while overlooking the numbers that are off the page. It is more important that we don’t overlook large positive (or negative) externalities that change our conclusions than to eke out a 7th digit of precision.
Optimization outcomes are quantitative, which can make them an attractive nuisance for believing they are overly precise and accurate. For many techniques, we can only measure 1 or 2 significant digits.
With a tool like Google-Wide Profiling, we can represent the fraction of Google fleet cycles in TCMalloc as a 17 decimal digit, double-precision floating point number. Writing out this many digits doesn’t actually mean the number has that precision. Spending a few additional milliseconds in TCMalloc on a single server will change some of the digits in that number. While the overall trend is stable, there is a day-to-day standard deviation in the total. A second-order effect with small magnitude relative to that variation might be entirely lost in the noise.
Aggregating more days can give us more samples, but longer longitudinal studies can experience confounding factors from unrelated changes in the environment if the true effect size is small. We may have achieved a more precise measurement, but changes to the environment will negatively impact our accuracy beyond the added precision.
We want to avoid confusing precision for accuracy, and vice-versa. Similar to how a long duration analysis might be stymied by confounding factors, a carefully controlled experiment might claim a high precision result without being accurate.
Many load tests repeatedly replay traffic and compare the performance between a modified and baseline configuration. These repeated runs can deliver large sample sizes to produce low standard deviation estimates, instilling a (potentially false) belief that the result is accurate. By construction, load tests are simplifications of production and so they may omit certain workload types, platforms, or environment features that would be seen in widespread deployment.
Even with tools like production A/B experiments, which can help give us accurate results by testing against the full variety of workloads, we still must account for nuances in the technique:
Spending time to account for these factors makes sense if we’re making real gains in accuracy: Large scale data locality and contention effects can be challenging to measure by other means. For small changes without these externalities, we may expend a lot of effort to measure without obtaining a better measurement.
Small changes create agility by breaking them into manageable, focused chunks, which aids testing, debugging, and review. This can complicate a measurement story, since it adds more pieces that we need to track.
Rather than allow the measurement tail to wag the dog by forcing us to combine unrelated optimizations or forgo them entirely, we may opt to track the aggregate benefit from a series of optimizations made over a brief period of time.
Even if we’re pursuing a general strategy of aggregating similar changes, we may want to separate out the effects of more tradeoff-heavy optimizations. A change that creates subtle edge cases or long-lived technical debt might be worth it if it’s a sufficiently large optimization. If we don’t end up realizing the benefits, we might prefer to roll it back in favor of looking for simpler approaches.
Aiming for greater accuracy can tempt us to pursue more complex methodologies. An approach that is easier to implement, but captures most of the impact and regression signals from our project, can afford us more time to work on the actual optimizations, look for externalities, and be just as accurate in the end.
Importantly, a simple measurement should not be confused with an incomplete one. The goal is to avoid needless precision, not to overlook significant impacts like transitive costs on other systems or shifts between different resource dimensions.
The more complex approach that captures small effects can draw attention to its details, causing us to overlook the larger effects that we missed. Even though we explicitly made a good faith attempt to quantify everything, it is easier to focus on the words on the page than the words absent from it.
While we might capture additional phenomena relevant to performance, the additional factors we consider introduce new sources of error for us. Joining multiple data sources can be tempting; but if we are not familiar with their idiosyncrasies, we might end up with surprising (or wrong) results.
Simple is explainable. “This performance optimization only has an effect during Australian business hours” might be completely correct due to diurnal effects, but it is harder to see the immediate connection between what our approach left implicit and a causal explanation.
When we consider additional factors for measuring a project, we should try to gauge the contribution of each relative to the significant figures provided by the others. If one signal contributes 1% +/- 0.1pp, it will overwhelm another term that contributes 0.01% +/- 0.001pp.
In choosing a measurement strategy, we want to strike a balance between precision, accuracy, and effort. Accurate measurements can help us to continue pursuing fruitful areas for adjacent optimizations, but we should be mindful where increasing effort produces diminishing returns.