Performance Tip of the Week #88: Measurement methodology: Avoid the jelly beans trap

Originally posted as Fast TotW #88 on November 7, 2024

Updated 2024-11-18

Measurement of performance optimizations often requires in-depth analysis due to the inherent stochasticity of workloads. We need to gather data to calculate the metrics we’ve selected. In this episode, we discuss the importance of defining your measurement methodology ahead of time rather than fishing for possible sources of significance after the data has been collected.

Choose a methodology, without peeking

Choose and publish a methodology before looking at the data. Ideally, this process is completed before any changes land. It is otherwise easy to pick the methodology that tells the best story (the “biggest number”) unwittingly after the fact. Preregistration of experiment methodology also helps avoid false positives in statistical analysis.

The Jelly Beans XKCD comic offers a humorous take on this situation, which is referred to alternately as data dredging or p-hacking. Even keeping this sort of situation in mind, data dredging remains a pernicious problem. Less obvious data dredging can occur in many ways.

Prefer simpler methodologies. Simpler methodologies offer fewer parameters to abuse and thus result in sounder analyses. Filters quickly combine to form a similar effect as the green jelly beans from the comic. A statement like “this performance optimization only has an effect during Australian business hours” would be generally suspect.
Identify which metrics are primary and which are secondary proxies. For example, if we ran an A/B experiment, we might have multiple performance metrics that we could consider. If we deferred making the choice of which metric to use, we might find post hoc reasons for preferring one metric or the other.
Reuse measurement techniques when appropriate. While we shouldn’t shy away from improving our techniques over time, reusing an approach reduces the risk that we’ve overfit our analysis to the results.

If we are making a number of similar changes, we should generally strive for measurement consistency across them.
Publish the observation interval of the analysis. Optional stopping of experiments increases the risk of false positives. A sufficiently random experiment that is run for an arbitrary amount of time will explore the entire outcome space. Stopping the experiment when we reach that outcome biases us towards finding a signal where there is none.
Precommit to a launch bar. For changes that add complexity, we want to be cautious that the results merit the long-term cost. The error bars for a very weakly positive result might overlap with zero.

If we’re removing complexity–deleting an unused calculation, cleaning up dead flags, etc., we would likely be inclined to do so anyway independent of the efficiency savings that are merely a nice-to-have bonus.

Publishing the methodology also means that we have fewer chances to add post-collection changes to analysis. Trivial-looking changes (“clearly errant outlier data points can be deleted”) may negatively affect the soundness of experiments.

Anchoring bias

Estimation is an important part of our project planning process. We worked on particular optimizations because we think it was important enough to be worth our time.

When we read out the result of our hard work, we need to put our estimate aside and not anchor on it. The estimate was intentionally rough. It only needed to be “good enough” to break ties in our decision making to prioritize project A over project B over project C.

We should avoid scrutinizing a result simply because it differs from our prior estimates and results.

A lower-than-expected result might be from optimistic assumptions that we made when estimating. Our “speed of light” for optimization might be that we would fully eliminate the cost of something, but the result in practice could be entirely different. To realize an earlier landing, we might have carved out exceptions and edge cases, rather than fully tackling them.
A higher-than-expected result might come about from a missed consideration. For example, when we enabled huge page-backed text, most of the upfront focus was on iTLB pressure because the optimization had “text” in the name. In practice, we also moved most of the executable’s constant global data onto huge pages as well, producing a drop in dTLB misses as well.

While we want to have a good calibration for our estimates, we can learn from the estimation mistakes as well. A “surprise” in either direction might indicate an opportunity that we did not notice upfront.

Nevertheless, extraordinary claims in either direction require extraordinary evidence. Sometimes an optimization appears to have an impact many times larger than we expect, or even than the apparent opportunity. Sometimes these are real due to step functions in cost or non-linear behavior around lock contention. More often, it is mismeasurement, and unexpectedly fortuitous results deserve the extra scrutiny and cross-validation.

Follow-through on the published methodology

After we have published a methodology, we should actually follow that methodology when reporting results. When going through the process of data analysis, the data shape may look completely different than expectations. It is important to obtain the results using the original methodology for posterity’s sake even when this occurs. Future adjustments may refine a readout of the impact of the project, but the number from the original methodology should still be produced.

It’s important to re-emphasize that we don’t anchor on the original estimate of impact. The shape of the data may look different because previous estimates were not accurate.

Post-measurement adjustments

After analyzing the data, it may become clear that some part of the methodology was lacking; for example, a confounding variable that was earlier not considered may become apparent in the analyzed data. Adjustments at this stage are incredibly prone to bias and statistical error, so additional caution is mandatory. The null hypothesis of a neutral result is also a perfectly acceptable result as well.

When considering a potential confounder, we want to have a scientifically sound mechanism to explain why we should include it. Suppose after we changed function A, we saw an unexpected improvement in function B. We could update our analysis to include the impact on function B, but we should be able to explain why we’re not including functions C or D as well.

The simplest criteria–all functions that use a particular data structure that we modified–is a more sound one than cherry-picking an arbitrary set of functions (e.g. all const functions affecting mutable members). “We included B because it improved and excluded C and D because they regressed” would be suspicious. When adding more criteria after-the-fact, we should be able to define them without peeking at the results from doing so.

While convenient, longitudinal analysis can pick up truly unrelated confounders from changes in usage:

Did another major optimization land to an application in the same release?
Did usage patterns for a library we were optimizing change?

Tossing data points after the fact because they “seem irrelevant” or “due to mix-shift” has a high bar. It’s very easy to bias our analysis by disproportionately suppressing negative outliers.

A/B experimentation takes more effort, but it can control for many of these variables and mix-shifts by isolating a single independent variable. Ablating or rolling back a change can also provide a compelling case for causality: If it did cause the unanticipated changes we saw, those should revert back as well with similar but reversed magnitude.

Unanticipated improvements can be a source of future opportunities by highlighting promising areas to look for more ideas, but this only works if the improvements are real. If we fool ourselves into believing there was an oversized effect, our follow-on projects may fail to pan out.

Conclusion

Committing to a measurement methodology ahead of time helps us preserve rigor when we might otherwise fish for possible sources of significance after the data has been collected.

Performance Guide

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate