Originally posted as Fast TotW #90 on February 6, 2025
Updated 2025-02-06
Quicklink: abseil.io/fast/90
Estimating savings can be useful for improving our decision making. In this episode, we discuss how to make and use estimates in the optimization lifecycle.
While looking for and developing optimizations, we use performance estimates frequently to decide how to approach problems and spend time:
The outcome of better estimates is better decisions, which informs us how much precision we might need. If we’re considering project A that will make things 5% faster and project B that will make things 0.1% faster, we don’t need to worry about additional significant figures for project A’s estimate. A more precise estimate for project A of 5.134% won’t change our prioritization all things being equal (effort, complexity, etc.).
Profiles are our jumping-off point for sizing how large an opportunity might be. When we’re deciding between competing choices, this information can give us a high-order bit to winnow the set for further investigation.
The best-case scenario is that we completely eliminate a bottleneck: If we drove something’s cost all the way to zero, would it be a worthwhile opportunity, or are there larger ones to pursue? This can be a handy heuristic as we’re filling in blanks for where an optimization might not be applicable. We can assume the best case scenario–an optimization applies everywhere, to the fullest degree–and see if the result is still meaningful. If it’s not, a more accurate estimate won’t change the outcome.
The most common starting point is a CPU or heap profile. We can refine this information to identify a rough upper bound for the opportunity. To illustrate this, consider a few examples:
As of February 2025, the protobuf implementation checks
the_repeated_field.size() != 0
for deciding whether a repeated field is
present and in need of further recursion. While there’s nothing inherently
wrong with this, a message with many absent fields might be able to check
their presence more efficiently with “has bit”-like auxiliary data rather
than touching many cachelines.
In the very best case, we can make two assumptions. First, the added cost of
toggling a bit every time we create a mutable reference to the field is
free. Second, every call to size()
internal to the protocol buffer
implementation can be elided with this because the fields are all absent.
This is a gross simplifying assumption (and not true), but it provides an
upper bound.
The C++ protocol buffer parsing implementation uses a series of tail-called
functions for parsing each type of field. Prior to the preserve_none
calling convention added in LLVM and adopted by protobuf, these function
bodies would preserve registers–adding extra push
/pop
s and code size
pressure–that would be unneeded.
We could use profiles to identify parsing functions with push/pop instructions setting up frames, sum their total cost, and scale based on past experience from optimizations that had eliminated stack frames in other libraries to produce an estimate.
Copies and destructors consume a large fraction of compute. While we cannot necessarily eliminate all of those copies and their corresponding destructors, some can be elided by using a const reference or a move. Tools allow us to join profiles against a specific C++ pattern to estimate the potential savings.
Other PMU events and server facts can give us more information about the size and scope of a problem.
Allocation size information from TCMalloc profiles can help tell us about
the size of a particular container. If all of the memory requests coming
from a std::vector<MyType>
instance are for exactly sizeof(MyType)
bytes, the vector’s size is likely 1.
While proving a strict bound may be difficult, distributions from the heap or container-specific profiles can give us a sense of what is most likely to be encountered and worth optimizing for.
For identifying high-value projects, we can winnow many by making best-case assumptions for the project: We can completely eliminate the bottleneck and so on. While this is a helpful simplifying assumption, we often need to estimate how much we can move the needle by to get more precision when needed.
Analogous past projects can help form a baseline for the estimates. For example, if job reshaping improved performance for a variety of workloads by 15% or more, we might apply a similar rule of thumb to our own reshaping project.
Temporarily disabling an optimization can give us a sense of how sensitive a workload is to a particular parameter. Ablating an existing optimization can be easier to implement than a prototype.
Comparing a simpler implementation that isn’t fit for purpose can tell us how close our current implementation is to the speed of light of the hardware. This can tell us about the cost of factors like data movement that we cannot make any faster without reworking data structures or the problem altogether.
memcpy
or memset
can be a ready stand-in for many data processing functions,
since they simply move data without performing computation. If an existing
implementation is already close to these substitutes, the headroom for
optimization might be small.
Latency rules of thumb or known hardware characteristics can give us another lower bound.
We may be able to deduce a fraction based on the known properties of a library.
std::vector
typically doubles when full, so size()/capacity()
will be
at least 50%.A simple mean of the two extremes might be a good enough approximation. When profiles are available for the container, we can be even more precise.
Combining different pieces of information may also help us estimate headroom. While job reshaping has substantial tried-and-true savings results to draw from, plotting memory usage against load can let us deduce the fixed overheads of a workload. An actual job reshape might save even more by improving cache locality, but the estimate might be sufficient to guide prioritization.
We may want to refactor the code to make the opportunity more clear in our profiles. A small inline function adds no costs at runtime, but the debug information that we add at build time allows us to disambiguate the costs in our profile more easily.
Consider a function that does some work and may take a lock to update shared state afterwards. If we want to optimize lock contention, knowing where we spend time under that lock can help us identify the functions we want to prioritize for optimization.
A prototype that does not handle all edge cases can refine our estimates of what a complete implementation will provide.
Microbenchmarks can provide an estimate of how much faster we can make a library. While we can make them arbitrarily complex and realistic, we are likely better off moving to other methods or even production. We may want to ask ourselves how a result will change our decision path:
Another scenario we might encounter is that we have an implementation, but it does not support all edge cases yet. These might lead to performance regressions, or could be hard blockers due to correctness. With an implementation that covers common use cases, we can run load tests to get a sense of where an optimization works and by how much.
A partial prototype can let us test our hypothesis. If we don’t see the expected performance delta despite shortcuts, our initial estimate may have been overly optimistic. If we do see a performance benefit (and it is correct), we may be able to land it and iterate from there.
Counterfactual deployments can give us information in a lower-risk setting. Consider a situation where we can use multiple compression algorithms. Making a wholesale change to which algorithm we use in a system could be risky: We don’t know whether the new ratio would be better or what its performance characteristics are like.
By choosing an alternative algorithm on a small sample of data and discarding the actual compressed bytes, we can collect data about the performance of the new algorithm. While this comes at a small runtime cost–double compression for perhaps 1-in-1000 samples–that cost is low and only temporary.
We want to avoid analysis paralysis. Carefully and precisely estimating every possible idea before commencing work could keep us busy indefinitely.
Many of these techniques are geared towards optimistic assumptions: We assume we can address the entire market. While the “worst case” might be no improvement that we land, we may want to scope where these scenarios can arise as well.
Using multiple strategies for producing an estimate can give us increased confidence in it. While we need to be cautious about anchoring bias and making a catastrophic, common assumption, having two (or three) distinct approaches land in the same ballpark can tell us that it might be a reasonable one.
It is valuable to understand where numbers in prior estimates came from. For example, if someone estimated a bottleneck could be improved by “10%” out-of-thin-air, we shouldn’t anchor on that assumption indefinitely. The improvement could be entirely correct–or entirely wrong–but we should challenge the assumptions when comparing against our own analysis.
After a project concludes, it’s useful to understand why our estimates were high or low compared to how a project actually came in. There’s no right or wrong answer here, so we don’t want to penalize ourselves for being wrong. Our goal is to learn from our misses to improve our thought processes in the future.
Judicious use of estimates can guide optimization work, allowing us to identify the most impactful projects and design decisions.