Performance Tip of the Week #74: Avoid sweeping street lights under rugs

Originally posted as Fast TotW #74 on September 29, 2023

Updated 2025-03-25

While their issues go by multiple names (i.e. the streetlight effect, the McNamara fallacy), proxy metrics present a seductive danger. Simply put, people often focus on improving the things they can measure and neglecting the things they cannot even if those other things are important. In this episode, we explore multiple stories of how this can go wrong to help ground ourselves in the real world failure modes that proxy metrics present.

Most folks are familiar with the streetlight effect’s titular story:

A police officer sees a drunkard searching for something under a streetlight and asks what the drunk has lost. They says they lost their keys and the two look under the streetlight together. After a few minutes the policeman asks if they is sure they lost them here, and the drunk replies, no, and that they lost the keys in the park. The policeman asks why they are searching here, and the drunk replies, “this is where the light is”.

But there is an analog that is equally dangerous.

A trash collector assigned to keep the streets free of litter may simply sweep the litter away from the street lights, thereby leaving the street visibly clean.

The “Data Center Tax”

In “Profiling a Warehouse-Scale Computer”, Svilen Kanev, et. al., coined the term, the “data center tax.” While a small part of every single application, the commonly used libraries could add up at scale when viewed horizontally. At the time in 2015, TCMalloc was as large as the biggest single application. This view on the data helps make these horizontal overheads “legible,” bringing an order to the chaos of understanding where fleet costs go. This analysis motivated a strategy of reducing these overheads as a relative portion of the fleet.

Nevertheless, there are places where externalities, both positive and negative, get overlooked. Focusing on the easily observed costs can cause us to miss opportunities for greater impact. We’ll start with several examples of where trying to reduce easily understood costs would have led to a worse outcome.

Artificial costs in TCMalloc

Starting from 2016, work commenced to reduce TCMalloc’s cost. Much of this early work involved making things generally faster, by removing instructions, avoiding cache misses, and shortening lock critical sections.

During this process, a prefetch was added on its fast path. Our fleet-wide profiling even indicates that 70%+ of cycles in the malloc fastpath are spent on that prefetch! Guided by the costs we could easily understand, we might be tempted to remove it. TCMalloc’s fast path would appear cheaper, but other code somewhere else would experience a cache miss and application productivity would decline.

To make matters worse, the cost is partly a profiling artifact. The TLB miss blocks instruction retirement, but our processors are superscalar, out-of-order behemoths. The processor can continue to execute further instructions in the meantime, but this execution is not visible to a sampling profiler like Google-Wide Profiling. IPC in the application may be improved, but not in a way immediately associated with TCMalloc.

Hidden context switch costs

We can observe the apparent cost of context switches by running perf and collecting PMU data from the processor. We end up running into two problems.

Running perf to collect PMU events perturbs the system. Collecting PMU events increases the overhead of context switches, to maintain the bookkeeping required for it.

The kernel cost is legible, and has been subject to substantial optimization. Changing from one process to another invalidates caches and the TLB. These will manifest as misses and stalls for ordinary user code, completely disconnected from the context switch itself.

We see a similar effect from changing scheduling parameters. Preferring to keep threads on the same core will improve cache locality, even though it may increase apparent kernel scheduler latency.

Sweeping away protocol buffers

Consider an extreme example. When our hashtable profiler for Abseil’s hashtables indicates a problematic hashtable, a user could switch the offending table from absl::flat_hash_map to std::unordered_map. Since the profiler doesn’t collect information about std containers, the offending table would no longer show up, although the fleet itself would be dramatically worse.

While the above example may seem contrived, an almost entirely analogous recommendation comes up with some regularity: migrate users from protos to structs. Due to its ubiquity, Protobuf has a large cost across the fleet.

It is true the structs have benefits in terms of simplicity. They have fewer features, are generally lower cost to the compiler, and lack hasbits. But they also have drawbacks, they lack support for arena allocation strategies, debugging affordances, and centralized optimizations.

Rewriting protobufs to use hand-written struct’s completely eliminates their contribution to our horizontal data, but still shifts the costs elsewhere. Migrations from internal protos to structs can produce very large performance wins, but these usually come by simplifying the surrounding code and dropping unused things. In other words, these migrations are usually catalysts for identifying room-at-the-middle of the stack for changing how APIs are used to make code more efficient.

Using protos alone or using structs alone will likely produce better wins than mixing and matching as it exposes the code to fewer copies and a smaller set of vocabulary types. For example, converting back and forth between absl::Cord and std::string can prove to be costly. While hand-rolled data structures might afford better optimizations upfront, these can often be brought back directly into the proto implementation–or where that is not practical, suggests a copy might be needed to interconvert.

Optimizing for externalities

We can use the problems and pitfalls of legible indicators to find opportunities to surface more data and uncover new optimizations. Thinking about the next metric to improve can help us understand the full scope of a single optimization and help us see the overall picture.

Seeing through reference counting and allocations

TCMalloc’s heap profiling feature tracks where memory is allocated. For long-lived, reference-counted objects, this view of the data is misleading. The code responsible for holding the last reference to the object may be completely unrelated to where it is allocated.

For example, our RPC framework is one of the largest sources of memory allocations in the fleet. It allocates memory that is held in absl::Cord. Since the framework does the allocation it looks like it is to blame for the memory usage. In reality, the application that holds onto the memory and identifying why/how the application holds the memory is the key to reducing memory footprint.

This problem motivated building a profiler for absl::Cords. Rather than solely track where data is allocated, we track where references occur. This allows us to ignore the false positive from Stubby and instead focus on code that holds long-lived Cords.

Similarly, consider when we embed type A as a data member in type B. Changing sizeof(A) indirectly changes sizeof(B) and the memory we allocate when we type new B() or std::vector<B>. Small types and memory paddings are peanut-buttered across the codebase, but in aggregate can consume large amounts of memory for commonly used types.

Improving data placement

The cost of memory allocation is not simply the cost of the memory allocator itself, but also the TLB and cache misses that happen elsewhere in the program. With nearly ~40% of cycles backend bound, waiting on caches or memory, the apparent scope for allocator changes is much larger.

With the Temeraire optimization, TCMalloc changed how it placed allocations onto pages in memory. By reducing TLB misses and stalls, this sped up the other parts of the application, leading to application throughput, latency, and RAM usage broadly improved because of the better placement decisions. Even though this is an across-the-board win, applications tended to spend more time in TCMalloc in relative terms, making this an apparent regression.

Reducing calling convention costs

libc functions like memcpy, memset, and memcmp are frequently used. Every core in Google’s fleet makes several calls to these functions every second. During the development of implementations now in llvm-libc, we saw that implementations that were “slower” in microbenchmarks produced better application performance. Smaller implementations have less impact on instruction cache misses, both for the memcpy itself and for evicting other useful code.

With the advent of optimized instructions like rep movsb, we can inline the implementation to a single, two-byte instruction. This instruction uses 3 registers, containing source, destination, and size, akin to the arguments to memcpy. X86_64 callers maintain the stack and preserve several registers over calls. Even though rep movsb can be outperformed by hand-optimized implementations in microbenchmarks, this strategy can reduce code cache pressure and external overheads.

Making effective prefetches

During the A/B experiment evaluation of adaptive prefetching, we observed that while topline, application-performance broadly improved, individual functions sometimes regressed. If a function generally benefited from the HW prefetcher, it often regressed. If a function were antagonized by memory bandwidth saturation, it could improve.

This data identified an opportunity to add software prefetches to functions with simple access patterns. This allowed us to optimize for the best of both worlds: keeping effective prefetches intact, while ablating often ineffective hardware prefetchers under saturation.

Closing thoughts

When improving anything (performance, quality, even reliability) do not mistake the measures you have for the actual thing you want to improve. All metrics are proxy metrics for some deeper goal, and we as engineers must always be careful to avoid externalizing costs into areas where we simply don’t measure them. Metrics are not a replacement for good judgment–they are a tool to help us sharpen our judgment.

Performance Guide

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate

Fast TotW #93: Robots never sleep