Performance Tip of the Week #79: Make at most one tradeoff at a time

Originally posted as Fast TotW #79 on January 19, 2024

Updated 2024-12-10

Developing and enabling optimizations can often involve tradeoffs: using more RAM and less CPU, choosing which problems to solve right away and which to defer, and so on. In this episode, we discuss examples and strategies for breaking down projects into smaller steps to increase velocity and maximize area under the curve.

Step-by-step migrations: SwissTable

Hash tables have many different implicit and explicit properties that affect their contracts and behaviors. In designing SwissTables and planning for their associated migrations, we made careful choices to defer, avoid, or consciously embrace a great number of tradeoffs when modifying implementation contracts.

The SwissTable migration primarily focused on iteration order, deferring several other valuable changes. In the SwissMap implementation, we introduced code that deliberately randomized hashtable iteration order so that future improvements would not have to deal with brittle tests relying on iteration order again.

Because of the randomization we had placed into both SwissTables and absl::Hash, we were able to make multiple optimizations to the underlying structure of both of these, changing the behavior for small sizes, adjusting the way windowing works, and completely replacing the core hash function. All of these optimizations were made after launch, because every step in the process was a net positive.

TIP: When fixing Hyrum’s Law issues, look for ways to deliberately perturb the behavior so future changes are easier.

Pointer stability of entries is extremely visible in a Hyrum’s Law sense. The SwissTable migration decided to deliberately defer this choice, by providing a pointer stable variant, absl::node_hash_map, that we directly migrated users to. We published guidance encouraging users to migrate themselves from absl::node_hash_map to absl::flat_hash_map, but focused our own efforts on getting people to take the first step. Users found the secondary step, migrating to absl::flat_hash_map, significantly easier because of the randomization we had already in place for SwissTable.

TIP: Stable intermediate states can form good stopping points for a migration to allow progress to be made without jumping all the way to the final state. Even when you cannot move people to the final state immediately, be clear in your written guidance what the best case should be.

When we started the SwissTable migration, we knew that we would likely want to release it in Abseil and that we wanted to build a new hashing framework (now absl::Hash) for it. But those steps weren’t ready. Rather than delay the launch until we had all the parts in place, we began the SwissTable migration without a custom hashing framework, and later migrated the spelling from a different namespace to its final home in absl.

TIP: Shipping early allows you to capture wins earlier and get important feedback from users.

Once absl::Hash was ready, we made sure to have built-in randomization and switched the default hasher for SwissTable to it. Because it worked by default and made the well lit path easier for customers, its adoption went incredibly smoothly.

TIP: Prefer switching defaults to migrating code if you can.

When we introduced hashtable profiling for monitoring tables fleet wide, some users were surprised that tables could be sampled (triggering additional system calls). If we had tried to have sampled monitoring from the start, the migration would have had a new class of issues to debug. This also allowed us to have a very clear opt-out for this specific feature without delaying the entire rollout. Additionally, the folks doing the migrations didn’t have to debug as many distinct types of failures, so each launch could be handled much faster!

TIP: Separate changes into distinct launches to isolate debugging to a distinct class of issues at a time.

Iterative improvement: Deploying TCMalloc’s CPU caches

When TCMalloc was first introduced, it used per-thread caches, hence its name, “Thread-Caching Malloc.” As thread counts continued to increase, per-thread caches suffered from two growing problems: a per-process cache size was divided over more and more threads, making each cache on average smaller, and more idle threads (threads » cores) meant more RAM was effectively inaccessible.

Initially, Andrew Hunter added support for per-CPU caches. Rather than cache memory for each thread, the implementation had a cache for each physical core on the machine. If a thread was descheduled, another thread would be able to reuse that same cache. Over time, organic adoption brought per-CPU cache usage to roughly half of the fleet’s CPU usage and memory allocations.

After extensive early adoption, TCMalloc’s default changed: unless otherwise requested, per-CPU caches were used. Due to the extra metadata per-CPU caches, this made a tradeoff of RAM for CPU. Rather than completely eliminate this cost, we opted to make this intentional tradeoff. While later optimizations minimized the RAM overhead of per-CPU caches, they would not materialize for several years, so this strategy allowed us to realize incremental benefits years earlier.

TIP: Identify ways to iteratively land improvements. This allows optimizations to be deployed when ready, without the R&D of implementing all anticipated optimizations upfront. This can help maximize the savings area-under-curve.

As years went by, though, core counts for the typical server had increased dramatically. Since the per-CPU cache uses an array–indexed by physical CPU ID–of caches, more metadata had to be allocated even though the number of cores used by a typical application had not grown commensurately. Additionally, since a job configured to use 16 cores might move around across a socket with 128 cores, we could populate caches on each of these cores even though the application might not actively run on them. These observations motivated development of several optimizations. TCMalloc includes extensive telemetry that enabled us to calculate the amount of memory being used for per-vCPU caches which provided estimates of the potential opportunity - to motivate the work - and the final impact - for recognising the benefit.

TIP: Tracking metrics that we intend to optimize later, even if not right away, can help identify when an idea is worth pursuing and prioritizing. By monitoring metadata memory usage and the number of active caches, we were able to identify when the problem had grown to be worth solving compared to other opportunities.

Decoupled rollouts: Limoncello

Experiments to switch-off hardware prefetchers under high system memory bandwidth usage showed significant performance improvements for the fleet. Analysis of the data showed that most workloads showed broad improvements, but a handful saw regressions.

By exploring the data at a per-function granularity, we were able to recognize the functions that most significantly regressed in A/B testing. Many of these regressions were in core libraries with streaming access patterns. By adding software prefetches to them, we were able to recover their performance when HW prefetchers were disabled.

TIP: A regression in one dimension or slice of data can sometimes become the kernel of an idea for a new opportunity.

While these prefetches primarily recover performance when the hardware stream prefetchers are turned off, they also help performance even when HW prefetchers are turned on. Hardware prefetchers require a warmup period to learn about the access pattern, and unlike higher-level C++ code, lack knowledge about when to stop prefetching when streams are short. Software prefetches can avoid a warmup period and maintain high precision. While turning off the HW prefetchers had illuminated the opportunity, the instantaneous warmup period and high precision meant that SW prefetchers were deployable right away without consideration for whether the HW prefetchers were on or off.

Once several prefetches were in place, the project was reevaluated with several teams that had previously seen regressions. Thanks to the SW prefetches, these gaps had narrowed dramatically. Since these were located in a handful of carefully studied and frequently optimized core libraries, maintaining the prefetches is feasible, even over successive hardware generations.

TIP: Rolling out changes independently from one another avoids unneeded couplings and simplifies rollout strategies.

Closing words

Structuring how we identify and roll out optimizations can help us move faster. By keeping our tradeoffs as simple and straightforward as possible, or better yet, avoiding tradeoffs altogether, we can iteratively improve performance without hitting stumbling blocks. Exporting metrics along the way allows us to identify the biggest opportunities as workloads and hardware evolve.

Performance Guide

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate

Fast TotW #93: Robots never sleep

Performance Tip of the Week #79: Make at most one tradeoff at a time

Step-by-step migrations: SwissTable

Iterative improvement: Deploying TCMalloc’s CPU caches

Decoupled rollouts: Limoncello

Closing words