Originally posted as Fast TotW #88 on November 7, 2024
By Patrick Xia and Chris Kennelly
Updated 2024-11-18
Quicklink: abseil.io/fast/88
Measurement of performance optimizations often requires in-depth analysis due to the inherent stochasticity of workloads. We need to gather data to calculate the metrics we’ve selected. In this episode, we discuss the importance of defining your measurement methodology ahead of time rather than fishing for possible sources of significance after the data has been collected.
Choose and publish a methodology before looking at the data. Ideally, this process is completed before any changes land. It is otherwise easy to pick the methodology that tells the best story (the “biggest number”) unwittingly after the fact. Preregistration of experiment methodology also helps avoid false positives in statistical analysis.
The Jelly Beans XKCD comic offers a humorous take on this situation, which is referred to alternately as data dredging or p-hacking. Even keeping this sort of situation in mind, data dredging remains a pernicious problem. Less obvious data dredging can occur in many ways.
Reuse measurement techniques when appropriate. While we shouldn’t shy away from improving our techniques over time, reusing an approach reduces the risk that we’ve overfit our analysis to the results.
If we are making a number of similar changes, we should generally strive for measurement consistency across them.
Publish the observation interval of the analysis. Optional stopping of experiments increases the risk of false positives. A sufficiently random experiment that is run for an arbitrary amount of time will explore the entire outcome space. Stopping the experiment when we reach that outcome biases us towards finding a signal where there is none.
Precommit to a launch bar. For changes that add complexity, we want to be cautious that the results merit the long-term cost. The error bars for a very weakly positive result might overlap with zero.
If we’re removing complexity–deleting an unused calculation, cleaning up dead flags, etc., we would likely be inclined to do so anyway independent of the efficiency savings that are merely a nice-to-have bonus.
Publishing the methodology also means that we have fewer chances to add post-collection changes to analysis. Trivial-looking changes (“clearly errant outlier data points can be deleted”) may negatively affect the soundness of experiments.
Estimation is an important part of our project planning process. We worked on particular optimizations because we think it was important enough to be worth our time.
When we read out the result of our hard work, we need to put our estimate aside and not anchor on it. The estimate was intentionally rough. It only needed to be “good enough” to break ties in our decision making to prioritize project A over project B over project C.
We should avoid scrutinizing a result simply because it differs from our prior estimates and results.
While we want to have a good calibration for our estimates, we can learn from the estimation mistakes as well. A “surprise” in either direction might indicate an opportunity that we did not notice upfront.
Nevertheless, extraordinary claims in either direction require extraordinary evidence. Sometimes an optimization appears to have an impact many times larger than we expect, or even than the apparent opportunity. Sometimes these are real due to step functions in cost or non-linear behavior around lock contention. More often, it is mismeasurement, and unexpectedly fortuitous results deserve the extra scrutiny and cross-validation.
After we have published a methodology, we should actually follow that methodology when reporting results. When going through the process of data analysis, the data shape may look completely different than expectations. It is important to obtain the results using the original methodology for posterity’s sake even when this occurs. Future adjustments may refine a readout of the impact of the project, but the number from the original methodology should still be produced.
It’s important to re-emphasize that we don’t anchor on the original estimate of impact. The shape of the data may look different because previous estimates were not accurate.
After analyzing the data, it may become clear that some part of the methodology was lacking; for example, a confounding variable that was earlier not considered may become apparent in the analyzed data. Adjustments at this stage are incredibly prone to bias and statistical error, so additional caution is mandatory. The null hypothesis of a neutral result is also a perfectly acceptable result as well.
When considering a potential confounder, we want to have a scientifically sound mechanism to explain why we should include it. Suppose after we changed function A, we saw an unexpected improvement in function B. We could update our analysis to include the impact on function B, but we should be able to explain why we’re not including functions C or D as well.
The simplest criteria–all functions that use a particular data structure that
we modified–is a more sound one than cherry-picking an arbitrary set of
functions (e.g. all const
functions affecting mutable
members). “We included
B because it improved and excluded C and D because they regressed” would be
suspicious. When adding more criteria after-the-fact, we should be able to
define them without peeking at the results from doing so.
While convenient, longitudinal analysis can pick up truly unrelated confounders from changes in usage:
Tossing data points after the fact because they “seem irrelevant” or “due to mix-shift” has a high bar. It’s very easy to bias our analysis by disproportionately suppressing negative outliers.
A/B experimentation takes more effort, but it can control for many of these variables and mix-shifts by isolating a single independent variable. Ablating or rolling back a change can also provide a compelling case for causality: If it did cause the unanticipated changes we saw, those should revert back as well with similar but reversed magnitude.
Unanticipated improvements can be a source of future opportunities by highlighting promising areas to look for more ideas, but this only works if the improvements are real. If we fool ourselves into believing there was an oversized effect, our follow-on projects may fail to pan out.
Committing to a measurement methodology ahead of time helps us preserve rigor when we might otherwise fish for possible sources of significance after the data has been collected.