Performance Tip of the Week #90: How to estimate

Originally posted as Fast TotW #90 on February 6, 2025

Updated 2025-02-06

Estimating savings can be useful for improving our decision making. In this episode, we discuss how to make and use estimates in the optimization lifecycle.

Why estimate?

While looking for and developing optimizations, we use performance estimates frequently to decide how to approach problems and spend time:

A sense of how large a problem domain is might tell us where to look for opportunities in the first place.
Given finite time and multiple projects to work on, we will prioritize those with the highest return on investment. Estimation lets us fill in a guess for “return,” before we have it in hand. Our goal is to make better decisions, not perfect-in-hindsight ones.
Within a specific optimization project, an estimate of the benefit might inform us how much complexity (fiddly edge cases, technical debt, etc.) we might be willing to tolerate.

The outcome of better estimates is better decisions, which informs us how much precision we might need. If we’re considering project A that will make things 5% faster and project B that will make things 0.1% faster, we don’t need to worry about additional significant figures for project A’s estimate. A more precise estimate for project A of 5.134% won’t change our prioritization all things being equal (effort, complexity, etc.).

How big is it?

Profiles are our jumping-off point for sizing how large an opportunity might be. When we’re deciding between competing choices, this information can give us a high-order bit to winnow the set for further investigation.

The best-case scenario is that we completely eliminate a bottleneck: If we drove something’s cost all the way to zero, would it be a worthwhile opportunity, or are there larger ones to pursue? This can be a handy heuristic as we’re filling in blanks for where an optimization might not be applicable. We can assume the best case scenario–an optimization applies everywhere, to the fullest degree–and see if the result is still meaningful. If it’s not, a more accurate estimate won’t change the outcome.

Exploring profiles

The most common starting point is a CPU or heap profile. We can refine this information to identify a rough upper bound for the opportunity. To illustrate this, consider a few examples:

As of February 2025, the protobuf implementation checks the_repeated_field.size() != 0 for deciding whether a repeated field is present and in need of further recursion. While there’s nothing inherently wrong with this, a message with many absent fields might be able to check their presence more efficiently with “has bit”-like auxiliary data rather than touching many cachelines.

In the very best case, we can make two assumptions. First, the added cost of toggling a bit every time we create a mutable reference to the field is free. Second, every call to size() internal to the protocol buffer implementation can be elided with this because the fields are all absent. This is a gross simplifying assumption (and not true), but it provides an upper bound.
The C++ protocol buffer parsing implementation uses a series of tail-called functions for parsing each type of field. Prior to the preserve_none calling convention added in LLVM and adopted by protobuf, these function bodies would preserve registers–adding extra push/pops and code size pressure–that would be unneeded.

We could use profiles to identify parsing functions with push/pop instructions setting up frames, sum their total cost, and scale based on past experience from optimizations that had eliminated stack frames in other libraries to produce an estimate.
Copies and destructors consume a large fraction of compute. While we cannot necessarily eliminate all of those copies and their corresponding destructors, some can be elided by using a const reference or a move. Tools allow us to join profiles against a specific C++ pattern to estimate the potential savings.

Leveraging other data sources

Other PMU events and server facts can give us more information about the size and scope of a problem.

Events like page table walks or what fraction of effective pages were small can tell us about the total size of the opportunity from huge pages.
Topdown analysis to break down whether a piece of code is frontend or backend-bound can tell us how much we can expect to save. If a particular function is largely backend-bound, reducing instruction count, branches, or applying SIMD might not be helpful. Conversely, if a function is frontend-bound, optimizing the memory accesses it makes might not be helpful.
Allocation size information from TCMalloc profiles can help tell us about the size of a particular container. If all of the memory requests coming from a std::vector<MyType> instance are for exactly sizeof(MyType) bytes, the vector’s size is likely 1.

While proving a strict bound may be difficult, distributions from the heap or container-specific profiles can give us a sense of what is most likely to be encountered and worth optimizing for.

How much can we change?

For identifying high-value projects, we can winnow many by making best-case assumptions for the project: We can completely eliminate the bottleneck and so on. While this is a helpful simplifying assumption, we often need to estimate how much we can move the needle by to get more precision when needed.

Past performance

Analogous past projects can help form a baseline for the estimates. For example, if job reshaping improved performance for a variety of workloads by 15% or more, we might apply a similar rule of thumb to our own reshaping project.

Temporarily disabling an optimization can give us a sense of how sensitive a workload is to a particular parameter. Ablating an existing optimization can be easier to implement than a prototype.

Adaptive Prefetching modulates HW prefetchers based on the machine’s memory bandwidth usage. While beneficial in aggregate, slicing by function allowed us to identify functions that were more sensitive to prefetching that were good candidates for software prefetching.

Speed of light

Comparing a simpler implementation that isn’t fit for purpose can tell us how close our current implementation is to the speed of light of the hardware. This can tell us about the cost of factors like data movement that we cannot make any faster without reworking data structures or the problem altogether.

memcpy or memset can be a ready stand-in for many data processing functions, since they simply move data without performing computation. If an existing implementation is already close to these substitutes, the headroom for optimization might be small.

Latency rules of thumb or known hardware characteristics can give us another lower bound.

Analytical methods

We may be able to deduce a fraction based on the known properties of a library.

SwissMap doubles its capacity on resize and supports a maximum load factor of 87.5%, so its minimum load factor is typically 43.75%.
A std::vector typically doubles when full, so size()/capacity() will be at least 50%.

A simple mean of the two extremes might be a good enough approximation. When profiles are available for the container, we can be even more precise.

Combining different pieces of information may also help us estimate headroom. While job reshaping has substantial tried-and-true savings results to draw from, plotting memory usage against load can let us deduce the fixed overheads of a workload. An actual job reshape might save even more by improving cache locality, but the estimate might be sufficient to guide prioritization.

Refactoring

We may want to refactor the code to make the opportunity more clear in our profiles. A small inline function adds no costs at runtime, but the debug information that we add at build time allows us to disambiguate the costs in our profile more easily.

Consider a function that does some work and may take a lock to update shared state afterwards. If we want to optimize lock contention, knowing where we spend time under that lock can help us identify the functions we want to prioritize for optimization.

Partial prototyping

A prototype that does not handle all edge cases can refine our estimates of what a complete implementation will provide.

Microbenchmarks can provide an estimate of how much faster we can make a library. While we can make them arbitrarily complex and realistic, we are likely better off moving to other methods or even production. We may want to ask ourselves how a result will change our decision path:

Is a neutral result likely to suspend our work, or would we try to make the microbenchmark “more realistic” by adding complexity?
Will we use the benchmark to fine tune the implementation? This can be helpful for improving our OODA loop, but we risk overfitting due to its pitfalls.

Another scenario we might encounter is that we have an implementation, but it does not support all edge cases yet. These might lead to performance regressions, or could be hard blockers due to correctness. With an implementation that covers common use cases, we can run load tests to get a sense of where an optimization works and by how much.

A partial prototype can let us test our hypothesis. If we don’t see the expected performance delta despite shortcuts, our initial estimate may have been overly optimistic. If we do see a performance benefit (and it is correct), we may be able to land it and iterate from there.

Partial deployment and counterfactuals

Counterfactual deployments can give us information in a lower-risk setting. Consider a situation where we can use multiple compression algorithms. Making a wholesale change to which algorithm we use in a system could be risky: We don’t know whether the new ratio would be better or what its performance characteristics are like.

By choosing an alternative algorithm on a small sample of data and discarding the actual compressed bytes, we can collect data about the performance of the new algorithm. While this comes at a small runtime cost–double compression for perhaps 1-in-1000 samples–that cost is low and only temporary.

Things to look out for

We want to avoid analysis paralysis. Carefully and precisely estimating every possible idea before commencing work could keep us busy indefinitely.

Many of these techniques are geared towards optimistic assumptions: We assume we can address the entire market. While the “worst case” might be no improvement that we land, we may want to scope where these scenarios can arise as well.

Using multiple strategies for producing an estimate can give us increased confidence in it. While we need to be cautious about anchoring bias and making a catastrophic, common assumption, having two (or three) distinct approaches land in the same ballpark can tell us that it might be a reasonable one.

It is valuable to understand where numbers in prior estimates came from. For example, if someone estimated a bottleneck could be improved by “10%” out-of-thin-air, we shouldn’t anchor on that assumption indefinitely. The improvement could be entirely correct–or entirely wrong–but we should challenge the assumptions when comparing against our own analysis.

Calibrate

After a project concludes, it’s useful to understand why our estimates were high or low compared to how a project actually came in. There’s no right or wrong answer here, so we don’t want to penalize ourselves for being wrong. Our goal is to learn from our misses to improve our thought processes in the future.

Closing words

Judicious use of estimates can guide optimization work, allowing us to identify the most impactful projects and design decisions.

Performance Guide

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate

Fast TotW #93: Robots never sleep