Originally posted as Fast TotW #74 on September 29, 2023
By Chris Kennelly and Matt Kulukundis
Updated 2023-11-10
Quicklink: abseil.io/fast/74
While their issues go by multiple names (i.e. the streetlight effect, the McNamara fallacy), proxy metrics present a seductive danger. Simply put, people often focus on improving the things they can measure and neglecting the things they cannot even if those other things are important. In this episode, we explore multiple stories of how this can go wrong to help ground ourselves in the real world failure modes that proxy metrics present.
Most folks are familiar with the streetlight effect’s titular story:
A police officer sees a drunkard searching for something under a streetlight and asks what the drunk has lost. They says they lost their keys and the two look under the streetlight together. After a few minutes the policeman asks if they is sure they lost them here, and the drunk replies, no, and that they lost the keys in the park. The policeman asks why they are searching here, and the drunk replies, “this is where the light is”.
But there is an analog that is equally dangerous.
A trash collector assigned to keep the streets free of litter may simply sweep the litter away from the street lights, thereby leaving the street visibly clean.
In “Profiling a Warehouse-Scale Computer”, Svilen Kanev, et. al., coined the term, the “data center tax.” While a small part of every single application, the commonly used libraries could add up at scale when viewed horizontally. At the time in 2015, TCMalloc was as large as the biggest single application. This view on the data helps make these horizontal overheads “legible,” bringing an order to the chaos of understanding where fleet costs go. This analysis motivated a strategy of reducing these overheads as a relative portion of the fleet.
Nevertheless, there are places where externalities, both positive and negative, get overlooked. Focusing on the easily observed costs can cause us to miss opportunities for greater impact. We’ll start with several examples of where trying to reduce easily understood costs would have led to a worse outcome.
Starting from 2016, work commenced to reduce TCMalloc’s cost. Much of this early work involved making things generally faster, by removing instructions, avoiding cache misses, and shortening lock critical sections.
During this process, a prefetch was added on its fast path. Our
fleet-wide profiling
even indicates that 70%+ of cycles in the malloc
fastpath are
spent on that prefetch! Guided by the costs we could easily
understand, we might be tempted to remove it. TCMalloc’s fast path would appear
cheaper, but other code somewhere else would experience a cache miss and
application productivity would decline.
To make matters worse, the cost is partly a profiling artifact. The TLB miss blocks instruction retirement, but our processors are superscalar, out-of-order behemoths. The processor can continue to execute further instructions in the meantime, but this execution is not visible to a sampling profiler like Google-Wide Profiling. IPC in the application may be improved, but not in a way immediately associated with TCMalloc.
We can observe the apparent cost of context switches by running perf
and
collecting PMU data from the processor. We end up running into two problems.
Running perf
to collect PMU events
perturbs the system. Collecting PMU
events increases the overhead of context switches, to maintain the bookkeeping
required for it.
The kernel cost is legible, and has been subject to substantial optimization. Changing from one process to another invalidates caches and the TLB. These will manifest as misses and stalls for ordinary user code, completely disconnected from the context switch itself.
Consider an extreme example. When our hashtable profiler for Abseil’s hashtables
indicates a problematic hashtable, a user could switch the offending table from
absl::flat_hash_map
to std::unordered_map
. Since the profiler doesn’t
collect information about std
containers, the offending table would no longer
show up, although the fleet itself would be dramatically worse.
While the above example may seem contrived, an almost entirely analogous recommendation comes up with some regularity: migrate users from protos to structs. Due to its ubiquity, Protobuf has a large cost across the fleet.
It is true the structs have benefits in terms of simplicity. They have fewer
features, are generally lower cost to the compiler, and lack hasbit
s. But they
also have drawbacks, they lack support for arena allocation strategies,
debugging affordances, and centralized optimizations.
Rewriting protobufs to use hand-written struct
’s completely eliminates their
contribution to our horizontal data, but still shifts the costs elsewhere.
Migrations from internal protos to structs can produce very large performance
wins, but these usually come by simplifying the surrounding code and dropping
unused things. In other words, these migrations are usually catalysts for
identifying room-at-the-middle of the stack for changing how APIs are used to
make code more efficient.
Using protos alone or using structs alone will likely produce better wins than
mixing and matching as it exposes the code to fewer copies and a smaller set of
vocabulary types.
For example, converting back and forth between absl::Cord
and std::string
can prove to be costly. While hand-rolled data structures might afford better
optimizations upfront, these can often be brought back directly into the proto
implementation–or where that is not practical, suggests a copy might be needed
to interconvert.
We can use the problems and pitfalls of legible indicators to find opportunities to surface more data and uncover new optimizations. Thinking about the next metric to improve can help us understand the full scope of a single optimization and help us see the overall picture.
TCMalloc’s heap profiling feature tracks where memory is allocated. For long-lived, reference-counted objects, this view of the data is misleading. The code responsible for holding the last reference to the object may be completely unrelated to where it is allocated.
For example, our RPC framework is one of the largest sources of memory
allocations in the fleet. It allocates memory that is held in absl::Cord
.
Since the framework does the allocation it looks like it is to blame for the
memory usage. In reality, the application that holds onto the memory and
identifying why/how the application holds the memory is the key to reducing
memory footprint.
This problem motivated building a profiler for absl::Cord
s. Rather than solely
track where data is allocated, we track where references occur. This allows us
to ignore the false positive from Stubby and instead focus on code that holds
long-lived Cord
s.
Similarly, consider when we embed type A
as a data member in type B
.
Changing sizeof(A)
indirectly changes sizeof(B)
and the memory we allocate
when we type new B()
or std::vector<B>
. Small types and memory paddings are
peanut-buttered across the code base, but in aggregate can consume large amounts
of memory for commonly used types.
The cost of memory allocation is not simply the cost of the memory allocator itself, but also the TLB and cache misses that happen elsewhere in the program. With nearly ~40% of cycles backend bound, waiting on caches or memory, the apparent scope for allocator changes is much larger.
With the Temeraire optimization, TCMalloc changed how it placed allocations onto pages in memory. By reducing TLB misses and stalls, this sped up the other parts of the application, leading to application throughput, latency, and RAM usage broadly improved because of the better placement decisions. Even though this is an across-the-board win, applications tended to spend more time in TCMalloc in relative terms, making this an apparent regression.
libc functions like memcpy
, memset
, and memcmp
are frequently used. Every
core in Google’s fleet makes several calls to these functions every second.
During the development of
implementations now in llvm-libc, we
saw that implementations that were “slower” in microbenchmarks produced
better application performance. Smaller implementations have less
impact on instruction cache misses, both for the memcpy
itself and for
evicting other useful code.
With the advent of optimized instructions like rep movsb
, we can inline the
implementation to a single, two-byte instruction. This instruction uses 3
registers, containing source, destination, and size, akin to the arguments to
memcpy
. X86_64 callers maintain the stack and preserve several registers over
calls. Even though rep movsb
can be outperformed by hand-optimized
implementations in microbenchmarks, this strategy can reduce code cache pressure
and external overheads.
When improving anything (performance, quality, even reliability) do not mistake the measures you have for the actual thing you want to improve. All metrics are proxy metrics for some deeper goal, and we as engineers must always be careful to avoid externalizing costs into areas where we simply don’t measure them. Metrics are not a replacement for good judgment–they are a tool to help us sharpen our judgment.