Performance Tip of the Week #72: Optimizing optimization

Originally posted as Fast TotW #72 on August 7, 2023

Updated 2025-02-18

Finding optimizations is increasingly crucial as Moore’s Law ends and we can no longer expect a free lunch from continued hardware innovation. In this episode, we discuss the process of finding and developing optimizations to work on.

Planning

“Plans are worthless, but planning is everything”

Just as efficiency projects are trying to maximize the productivity of our hardware, the projects themselves can be optimized by effective planning. Project selection makes a huge difference in outcomes and impact.

Realized impact of a project is the product of several factors. When we are first starting to think about an optimization, we estimate all of these factors. While it’s rewarding for our estimates to be correct, the primary goal is to have just enough information to make a better decision and set priorities–to preferentially work on optimization “A” over optimization “B” because “A” has a larger expected ROI. Oftentimes, we only need a single significant figure to do so: Spending more time making a better estimate does not make things more efficient by itself. When new information arrives, we can update our priors accordingly.

Once we have identified an area to work in, we can shift to thinking about ways of tackling problems in that area.

What is the largest potential win?

For example, we have limited headroom to optimize a function using a single CPU core out of entire warehouse-scale computers. Even if we can drive its cost to zero, the impact is low. In contrast, the opportunity cost of not working on more important areas is high.

How much can we do about it?

It is not enough to observe “X is expensive”: We need to be able to change X too.

“Speed of light” analysis might provide a helpful upper bound here. If an operation needs to read data from memory, process it, and write it out, it’s hard for it to be substantially faster than memcpy. Performance does not need to ultimately converge with our rough analogue, but if we’re already close, it may be hard to meaningfully affect the performance of X.

An operation may already be as simplified as it can be–for example, it is hard to speed up adding two integers. We may need to look further up the stack–add less often–to realize performance gains.

Can we measure what changed, if so, how?

An unmeasured optimization is a tree that fell in the woods with no one around.

For example, some changes can have highly non-local effects that can make accurate measurement more challenging. TCMalloc has an “expensive” prefetch which makes TCMalloc look more expensive, but improves topline application performance. In a distributed system, changing the requests sent by a client to a server can dramatically impact the server’s costs, but not necessarily affect the metrics the client sees.

What is our likelihood of success?

Changing a minor implementation detail can be straightforward, or we can uncover Hyrum’s Law dependencies on existing quirks. A brand new API may be easy to implement, but challenging to grow adoption for. If others have looked at the same problem before, things might be different this time, but we may want to temper our optimism.

How long will it take?

This estimate should include preliminary analysis, implementation, code review, debugging, rollout, measurement, and a period of refinement after launch.

We want to save several multiples in resource costs as our time investment, both to ensure a positive return on investment and to balance against errors in our estimates.

How long will it live?

Optimizing code that is there to stay for years is a safer bet. When the system in question is likely to be replaced soon, we should consider the expected lifespan of our work.

Sometimes it makes sense the leave the idea alone and move on to the next item in the stack rank.

In other cases the optimization work is transferrable or it raises the efficiency bar for the new system so the economic effect of the optimization outlives the original implementation.

Standing on the shoulders of giants

“Great artists steal”

Frequently, we can find opportunities by growing adoption or enhancing existing optimizations.

Historically, “hugepage text” was used to reduce iTLB misses for only a portion of servers in the fleet. For the remainder, realizing an improvement in application productivity was a single flag-flip away. A lot of factors that go into planning were already de-risked: We knew it would appreciably move the needle for a large portion of the fleet if enabled, leaving us to focus on how to get there.
AutoFDO uses production profiles to provide feedback to the compiler to optimize program performance. Once the initial infrastructure was built, we were able to get further savings with a multi-pronged approach: increasing adoption, improving profile quality, and building further optimizations like FS-AFDO, cmov-vs-branch, and function splitting on top of it.

Borrowing ideas, even if seemingly simple, from another area, can go a long way. “Don’t do extra work” sounds obvious, but bringing fresh eyes to a problem and cross-pollinating ideas can be very effective.

Once we have some ideas to try out, we want to shift towards evaluating them systematically.

Optimization lifecycle

“If it disagrees with experiment, it’s wrong. In that simple statement is the key to science.”

Derisking an idea

In our initial planning, we made a few assumptions to estimate the probable return on investment from a project. Our first priority is to de-risk those assumptions by prototyping our change and evaluating it.

Benchmark to improve estimates. This lets us iteratively improve our estimates, with more effort to achieve higher fidelity.
Evaluate impact on secondary metrics. Even if we can’t see the end-to-end, fleetwide impact immediately, proxies–PMU events, profiles, or even individual applications–can tell us if we’re on the right track.
Cut every corner. Making simplifying assumptions or bypassing edge cases can get us data sooner. Even if we’re not in a necessarily optimal place, we can confirm that our hunch about an opportunity was valid. This exercise also helps with differentiating which project dependencies are truly critical and those that are merely nice to have.

If we try out an idea and cannot move the metrics we expected to, our hypothesis might be wrong. We might realize that something couldn’t be optimized after all, or it wasn’t on the critical path as expected. These explanations can let us tweak our plans, scrub things altogether, or inform future areas of investigation.

Once we’ve demonstrated that we have a viable optimization in-hand, we can focus on the steps towards landing it.

Building the idea out

Test failures can point us to edge cases that constrain our implementation. This may require tweaking our implementation to retain behaviors or clean up usage that is no longer permitted. The choice is largely a judgment call: Having a simpler implementation will be easier to optimize further in the long run, but a herculean cleanup may prevent us from landing anything at all. This may also indicate a good time to insert randomization and other defenses to make future optimizations in this space smoother.

With benchmarks, we can fine-tune our idea to try out different parameters and make sure that it improves performance in a variety of situations.

We need not find the precisely optimal parameters, since we’re liable to simply overfit to our model or loadtest. Optimality might be an illusion.
Making an initial change to a library can often be the hardest as we uncover unknown unknowns, especially when we move forward with it. Optimizing every cycle or byte of RAM doesn’t matter until we actually overcome the hurdles of turning it on.

For example, moving from GoodFastHash and std::hash to absl::Hash improved performance, but required taming Hyrum’s Law by introducing a randomized hash algorithm and identifying brittle tests. Once those issues were surmounted, we were able to iterate from there, replacing absl::Hash’s algorithm with a different, more efficient one.

Once we have something that works well and has sufficient polish, we can move forward with it.

Launching

Production is ultimately what matters and our goal is to bring the optimization there for evaluation. Launching and iterating on an optimization helps in several ways.

It delivers value sooner. The area under the curve of efficiency over time is larger by having small, successive landings than if we tried to realize the “better” optimization all at once, but took much longer to deliver it. We doubled the impact of TCMalloc’s “Temeraire” optimization in following years, but launching early provided benefits years earlier than if we had waited.
Things may perform differently, for better or worse, than we expected. This can inform next steps for where to look for further ideas, allowing us to stand on our own shoulders so to speak.

For an optimization enabled broadly, opt-outs can tell us about what edge cases have proven to be truly important and need more attention, leading to further optimization insights. For example, a single, long-standing opt-out from Temeraire informed two distinct optimizations.

Success in one area brings opportunities for cross-pollination. We can take the same solution, an algorithm or data structure, and apply the idea to a related but different area. Without the original landing, though, we might have never realized this.
Circumstances are continually changing. The assumptions that started a project years ago may be invalid by the time the project is ready. Intermediate milestones allow us to validate assumptions along the way.
We may have bypassed non-critical dependencies for the sake of expediency. When those dependencies are ready, we can revisit our choices to enhance our optimization. For example, a library might have required extensive refactoring to fully enable a particular optimization, so we opted to get most of the benefit upfront and realize the rest later when the refactoring was completed.

Landing

Once we’ve launched an optimization, we want to land it and for that we must measure it. We are primarily interested in measuring our primary metric, for example, seeing queries-per-CPU go up. Our secondary, proxy metrics–CPU time in a particular function, PMU events like cache or TLB misses–help to confirm and support the claim that the optimization had the intended effect. This distinguishes the outcome from mere coincidence–did we actually speed something up, or did another change somewhere else happen to make the application faster? While we relied on estimates to guide starting to work in an area, we also don’t want those expectations to bias our measurements either: The actual data might be better (or worse) than what we anticipated and reality is what matters.

Summary

Project selection and project execution can greatly impact optimization outcomes. By judiciously selecting the areas we dig and adopting a strategy of launching-and-iterating, we can unlock significant savings.

Performance Guide

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate

Fast TotW #93: Robots never sleep