Originally posted as Fast TotW #94 on June 26, 2025
Updated 2025-06-27
Quicklink: abseil.io/fast/94
Profiling tools are vital to narrowing the search space of all possible changes we could make to highlight the best uses of our time. In this episode, we discuss strategies for identifying when we might want to seek out more data and when we are in the midst of diminishing returns.
The ultimate outcomes that we want from our tools are insights that let us improve performance.
We can almost always do more analysis – increase the time horizon, auxiliary metrics considered, alternative bucketing strategies – to try to make a decision. This can be initially helpful at finding unforeseen situations, but we might mire ourselves in spurious findings or develop analysis paralysis.
At some point, redundant tools, extra precision, or further experiments add less value than their cost to obtain. While we can seek these things anyway, if they aren’t changing the decisions we make, we could do without and achieve the same outcome.
It’s easy to fall into a trap of looking for one last thing, but this carries an opportunity cost. We will occasionally make mistakes in the form of unsuccessful projects. While some of these could be avoided in hindsight with extra analysis or a small “fail fast” prototype, these are tolerable when the stakes are low.
When considering an experiment to get additional information, ask yourself questions like “can the possible range of outcomes convince me to change my mind?” Otherwise, constructing additional experiments may merely serve to provide confirmation bias. The following are examples where the extra analysis seems appealing, but is ultimately unnecessary:
We might try out a new optimization technique with minimal rollout scope and complexity. If it works, we’ve gained confidence. If it fails, we should change our plans. If either of those possible outcomes might be insufficient to persuade us, we should instead develop a plan that will provide a definitive next step.
If we plan to iteratively move from microbenchmark to loadtest to production, we should stop or at least reconsider if the early benchmarks produce neutral results rather than moving to production. The fact that a benchmark can give us a result different from production might motivate us to move forward anyway, but tediously collecting the data from them is pointless if our plan is to unconditionally ignore them.
Estimates only need to be good enough to break ties. If our plan is to look at the top 10 largest users of a library for performance opportunities, improved accuracy might adjust the order somewhat, but not dramatically. Historical data might give us slightly different orders for rankings chosen using yesterday’s data versus last week’s, but the bulk of the distribution is often roughly consistent.
Consider a project where we’re optimistic it will work, but we have a long slog ahead of ourselves to get it over the finish line. There are a couple of strategies we can use to quickly gain confidence in the optimization or abandon it if it fails to provide the expected benefits:
We might be interested in deploying a new data layout, either in-memory or on-disk. A simple prototype that simulates an approximation of that layout can tell us a lot about the probable performance characteristics. Even before we can handle every edge case, the “best case” scenario not panning out gives us a reason to stop and go no further.
For example, as we attempt to remove data indirections, we might
microbenchmark a few candidate layouts for a data structure,
starting from the status quo std::vector<T*> to std::deque<T>,
absl::InlinedVector<T*, 4>, and std::vector<T>. Each of these solutions
has its merits based on the constraints of T and the access pattern of the
program. Having a sense of where the largest opportunity is for the current
access pattern can help us focus our attention and avoid a situation where
we sink time into a migration before we can derisk anything.
Getting things over the finish line may still be 80% of the work, but our initial work will have derisked the outcome. Investing time in polish for a project that ultimately fails carries a high opportunity cost.
The analysis we do is just to determine the course of action we should take. Ultimately, we care about the impact of the action, not how elegant our plans were. A promising change backed by good estimates and preliminary benchmarks has to be successful when deployed to production to actually be a success. Good benchmarks alone are not the outcome we are after. Sometimes we’ll find that promising analysis or benchmarks do not turn into benefits in production, and the outcome is instead an opportunity for learning.
If we chose among several candidate solutions to a problem, we should confirm that their qualities held up. For example, we might have picked a strategy that was more complex but perceived to be more reliable than an alternative. Even if the project was a success otherwise, but reliability instead suffered, we should reconsider if the alternatives are worth revisiting.
Our tools sometimes have blind spots that we need to consider when using them. Simplifications to get a “good enough” result can help our priors, but we should be cautious about extrapolating too broadly from them.
More data points with the same caveats merely make us overconfident, not more accurate. When the stakes are higher, cross-validation against other information can help uncover gaps. More data points from distinct vantage points are more valuable than more data points from the same ones. We should prime ourselves to consider what new evidence would cause us to reconsider our plans.
Many profilers are cost-conscious to minimize their impact. To do this, they employ sampling strategies that omit some data points.
Our hashtable profiler makes its sampling decisions when tables are first mutated. Avoiding a sampling decision in the default constructor keeps things efficient, but means that empty tables are not represented in the statistics. Using other profilers, we can determine that many destroyed tables are in fact empty.
Historically, TCMalloc’s lifetime profiler had a similar caveat. To simplify the initial implementation, it only reported objects that had been both allocated and deallocated during a profiling session. It omitted existing objects (left censorship) and objects that outlived the session (right censorship). This profiler has since been improved to include these, but understanding a profiler’s limitations is crucial to avoiding drawing the wrong conclusions from biased data.
Profilers tracking CPU cycles are often measuring how long it took for an
instruction to retire. The profile hides the cost of instructions
that come after a high latency one. In other situations, diffusely attributed
costs may obscure the actual critical path of the function found only by
careful analysis or tools
like llvm-mca.
These examples illustrate how not everything can be measured with a sampling-based profiler, but there are often different approaches.
Running a load test or even canarying a change in production for a single application can increase our confidence that something will work. Nevertheless, the limited scope doesn’t assure us that the effect will be the same on a wider population.
This pitfall cuts in both positive and negative directions. If we have an optimization for a library that isn’t used by an application, no amount of testing it with that application is likely to produce a real effect. A negative result where there is no opportunity for the optimization to show benefit should not deter us, but we might abandon a good idea because of the spurious result. Conversely, an optimization might falter as it is brought to a broader scale, as our initial data points were biased by streetlamp effects.
We can avoid this by measuring the effect on the wider population or by choosing a broadly representative set of data points, instead of just one. Minimally we should be confident that the method we have chosen to evaluate the optimization has the potential to show a positive or negative impact.
As we work to gather data to guide our decisions, we should work to ensure we’re looking for features that could lead us to change our plans. It is easy to fall into the trap of seeking additional data points to increase confidence, but we may merely be falling into a trap of confirmation bias.