Performance Tip of the Week #39: Beware microbenchmarks bearing gifts

Originally posted as Fast TotW #39 on January 22, 2021

By Chris Kennelly and Alkis Evlogimenos

Updated 2023-10-10

Quicklink: abseil.io/fast/39

Benchmarks are only a tool for debugging efficiency: Production is ultimately what matters. Benchmarks analyze the performance of code under the specific circumstances created and maintained by the benchmark. They cannot perfectly predict the performance of code in the real world. In this episode, we discuss some of the pitfalls of microbenchmarks and mitigation strategies.

For example, we can use the following series of benchmarks for evaluating changes to search query performance:

  • Posting List iteration benchmarks
  • Single task loadtests
  • Cluster loadtests of many tasks
  • Production

Each benchmark is increasingly more complicated and increasingly more accurate but takes more time to run. It is not uncommon for the first and last to disagree. It is less common for single task loadtests to disagree with cluster loadtests but it happens. The reason to have the hierarchy of benchmarks is to optimize the iterative process of developing performance optimizations themselves. If we are able to observe and iterate faster (a shorter “OODA loop”), we explore more ideas more quickly.

There is a parallel here to testing, starting from narrow scope (unit tests) to increasing scope (integration and system tests) to production itself. Like correctness testing, performance testing is complicated. Unfortunately, performance testing is less predictive than correctness testing, especially at the micro scale. Surrounding code or competing processes might interfere with network resource, memory bandwidth, CPU instruction decoding, branch predictor, processor cache utilization, flash i/o, mutexes. If, for example, we have medium confidence that a passing unittest suggests correctness in prod, then we should have relatively low confidence in benchmark results materializing the same way in prod.

Where benchmarks mislead

There are situations where the results from benchmarks can be misleading when applied to full applications. The following are some examples of situations where the results from benchmarks are not applicable at application scale.

memcmp bloat

Google’s software tends to be instruction footprint-heavy, challenging our ability to effectively cache it, which leads to frontend stalls on our processors. The functions provided by libc are no exception.

In “AsmDb: Front-End Stalls in WSC” (Figure 13 / Section 4.4), the glibc implementation for memcmp was ~6KB, leading to icache misses of its own and evictions of other functions. The implementation for glibc is in hand-written assembly, using a variety of techniques based on comparison size to obtain “good” performance. 99% of cycles in the function span 41 cache lines, or about 2.6KB of cache capacity.

Individually, these small micro-optimizations were entirely justifiable with microbenchmarks. We saw this in greater detail while developing a Google workload-tuned memcpy implementation: “Faster” implementations on microbenchmarks had larger code footprints and produced worse macrobenchmark results than seemingly slower implementations that had smaller code footprints.

This is undoubtedly a place where hardware acceleration can shine to deliver its best possible performance consistently, rather than having programmers optimize their implementations for each generation (only to make things worse in real workloads).

Arithmetic versus load

Consider a simple masking function, to get the lower bits of a value:

uint32_t getLowerBits(uint32_t value, int bits) {
  return value & ((uint64_t{1} << bits) - 1);
}

This requires several instructions and processor uops to shift, subtract, and finally mask. This might motivate loading a precomputed array of masks.

constexpr uint32_t kMasks[] = {0x1, 0x3, 0x7, 0xf, ...};

uint32_t getLowerBits(uint32_t value, int bits) {
  return value & kMasks[bits];
}

This might appear to be a profitable optimization in microbenchmarks. kMasks can be consistently cached in L1. Modeling cache behavior in microbenchmarks is challenging: Microbenchmarks tend to have small working sets that tend to be cache resident. Real code, particularly Google C++, is not.

In production, the cacheline holding kMasks might be evicted, leading to much worse stalls (hundreds of cycles to access main memory). Additionally, on x86 processors since Haswell, this optimization can be past its prime: BMI2’s bzhi instruction is both faster than loading and masking and delivers more consistent performance.

When developing benchmarks for SwissMap, individual operations were benchmarked in two ways: always triggering a cache hit and always triggering a cache miss. The latter can be achieved by having enough hashtables that their working set will not fit in cache, then picking a hashtable to lookup at random on every iteration of the benchmark.

Having explicit benchmarks for the boundary conditions - always-cache-hit and always-cache-miss gives more insight on how changes to the code affect its operations under different conditions and help develop intuition on which operations are important to optimize to reach our goal: improve production performance.

TCMalloc’s suspicious prefetch

TCMalloc has a suspicious prefetch on its allocation path. By the time we’ve computed which object to return, prefetching would be too late: User code could begin using the object within a few cycles. Instead, it prefetches the next object of the same size class that would be returned. This object will not be used by the application until the next time we allocate another object of the same size.

This prefetch appears to be extraordinarily costly: Microbenchmarks measuring allocation performance show potential savings if it were removed and Google-Wide Profiling shows 70%+ of cycles in new’s fastpath on the prefetch. Removing it would “reduce” the data center tax, but we would actually hurt application productivity-per-CPU. Time we spend in malloc is less important than application performance.

Trace-driven simulations with hardware-validated architectural simulators showed the prefetched data was frequently used. Additionally, it is better to stall on a TLB miss at the prefetch site–which has no dependencies, than to stall at the point of use.

Pitfalls

There are a number of things that commonly go wrong when writing benchmarks. The following is a non-exhaustive list:

  • Data being resident. Workloads have large footprints, a small footprint may be instruction bound, whereas the true workload could be memory bound. There’s a trade-off between adding instructions to save some memory costs vs placing data in memory to save instructions.
  • Small instruction cache footprint. Google codes typically have large instruction footprints. Benchmarks are often cache resident. The memcmp and TCMalloc examples go directly to this.
  • Sensitivity to function and branch alignment. Small changes to code layout and branch alignment can have large effects on microbenchmark performance. Code being changed can also affect neighboring code that is actually benchmarked, leading to paradoxical speedups and slowdowns. For example, memcpy has 20% swings from this effect. For several years, snappy had “load bearing nops” to perturb branch alignment of an inner loop to an optimal state. Stabilizer (by Berger, et. al.) deliberately perturb these parameters to improve benchmarking statistical quality.
  • Representative data. The data in the benchmark needs to be “similar” to the data in production - for example, imagine having short strings in the benchmark, and long strings in the fleet. This also extends to the code paths in the benchmarks being similar to the code paths that the application exercises.
  • Benchmarking the right code. It’s very easy to introduce code into the benchmark that’s not present in the real workload. For example, using a random number generator’s cost for a benchmark could exceed the cost of the work being benchmarked.
  • Being aware of steady state vs dynamic behaviour. For more complex benchmarks it’s easy to produce something that converges to a steady state - for example if it has a constant arrival rate and service time. Production workloads may demonstrate more variability.

Leveraging benchmarks fully

Focus on writing microbenchmarks that test individual properties of the component under test. Using Search as an example, consider the compute kernel of search intersects posting lists (PLs). A posting list is a sorted sequence of hits, where a hit is an occurrence of a term in a document. When a query like “dog AND cat” is executed, the PLs for terms “dog” and “cat” are intersected to find the documents that contain both “dog” and “cat”.

For PLs we benchmark the basic operations of a PL:

  • PL iterator creation/destruction
  • PL iterator iteration
  • PL iterator advance (jump forward)

Observe production behaviour to determine the appropriate code and functionality to optimize:

  • Profile production jobs to identify which operation is critical to performance.

    In some cases, PL iterator advance is where the majority of CPU cycles are spent. In others, PL iterator creation is as critical as PL iterator advance.

  • After identification of the above, one can start looking for opportunities to:

    • improve PL iterator creation/destruction without pessimizing PL iterator advance
    • improve PL iterator advance without pessimizing PL iterator creation/destruction

    Note: PL iterator iteration is largely irrelevant - at this particular time and shape of the system

After coming up with a strategy and prototype to improve the two microbenchmarks in isolation, run a more accurate test (single task) to validate that this results in meaningful improvements. The next steps depend on whether this, more accurate, benchmark also produces favorable results:

  • If not, it is very important to understand why and fix the microbenchmark: it is likely not measuring what is intended!
  • If yes, then proceed with higher accuracy tests (full cluster/prod) and gather artifacts measuring the impact of the change.

One might wonder why writing the PL iterator iteration benchmark is useful if it does not show up in production profiles. One useful piece of information we can extract from this benchmark is the relative performance of iteration vs advance. With such knowledge at hand, one can look into opportunities to use iteration more prominently in the intersection kernel code. For example, it may be more efficient to replace the advance operation with multiple iteration actions. Pragmatically, even though iteration contributes few cycles, we don’t want its performance to regress to the point where it does take significant time.

Summary

Benchmarks are a useful way to prototype the impact of code changes on production systems. However, writing benchmarks that accurately reflect production performance is hard. There are multiple pitfalls where the size of the benchmark, or other of its characteristics, may lead to incorrect estimates of production impact. One approach to solving this is to have multiple benchmarks, with increasing fidelity and complexity. Understanding why a particular benchmark does not produce representative results is a critical step in improving benchmark fidelity, and can even produce insights into production behavior.


Subscribe to the Abseil Blog