Originally posted as Fast TotW #62 on July 7, 2022
By Chris Kennelly, Luis Otero and Carlos Villavieja
Updated 2024-09-04
Quicklink: abseil.io/fast/62
To accomplish useful work with processors, programs need to access main memory. The rate that programs can transfer data to/from main memory has not grown as fast as processors getting greater per-core performance and more cores. The Memory Wall, long predicted, is here. Memory bandwidth is increasingly the bottleneck for how much useful work modern data centers can accomplish with its CPUs. Optimizing for memory bandwidth productivity can make code faster–by reducing memory stalls–and RAM footprints smaller. In this episode, we discuss techniques for identifying hotspots and reducing their impact.
Reads and writes while a program is running trigger accesses to memory. Accessing RAM is slow–approximately 100 nanoseconds–so processors have added layers of caches in between the execution units and the physical modules of RAM. On x86, this started with a single kilobyte-sized cache on the 386, but modern processors have several layers of caches (L1/L2/L3). The closer to the processor, the smaller but faster the caches typically are.
The processor’s caches work at a granularity of a cache line, typically a
64-byte chunk of data. This simplifies the hardware–which can track whether a
cache line is dirty and needs to be written to main memory at the cache line
level–rather than say tracking byte-by-byte. This abstraction is also useful in
practice–common primitive data types like int
and T*
all span several
bytes. Accessing the cache line from main memory is an all-or-nothing affair:
Loading any one byte means the whole line is read from memory.
In addition to a program’s loads and stores, the processor can issue prefetches for data at the hardware level. These prefetches need to be timely, the data needs to be ready before the program actually needs it, but not so early that it is evicted from the cache before it can be used. Prefetches also need to be useful: Pulling in a cache line that ends up being unaccessed consumes memory bandwidth without the benefit of avoiding a cache miss on the critical path.
The processor’s memory bus can handle thousands of transfers per second, limited by technology, power requirements, and physics. While cache sizes and the memory bus differ across different architectures, the same general principles apply. At the socket level, this can give the processor several GB/second of memory bandwidth. These windows to transfer data are akin to seats on an airplane, though: If the airline oversells a flight and everyone shows up at the airport, several passengers will be forced to wait for the next flight (in the case of CPUs there aren’t any refunds). Bursts in memory accesses, especially across many cores, can saturate our memory bus, even if the average over time is relatively low. Similarly, once a plane takes off, any unsold seats are wasted and cannot aid future bursts.
This perspective suggests two paths to reducing how much data needs to move to and from main memory: 1) access fewer things; 2) increase the reuse of the data the program accesses.
Making a program’s working set more cacheable allows it to be less bottlenecked on memory bandwidth and fewer cache misses reduce memory latency, allowing the program to speed up.
We can use data collected with the perf
tool on a locally executed process.
While LLC misses are only an approximation–this event only tracks misses on loads–it can be quite effective for finding and fixing bottlenecks. Hardware performance counter limitations largely restrict to load events while simultaneously having precise attribution. Attribution to functions is what allows us to find source code-level optimizations. A cache miss on a load is most likely a read transaction on the memory bus and also probably corresponds to a previous cache eviction at some earlier point in time. The absolute numbers of misses are less important for finding places to look for optimizations than their relative rankings.
A key part of optimization work is measuring the outcomes. While production is ultimately what matters, smaller scale proxies such as microbenchmarks can help with exploring the solution space.
One challenge for optimizing for cache misses with a microbenchmark is that the working set may be too small and its data will fit neatly into the processor’s caches. One strategy used for Abseil’s Hashtables is to have “hot” and “cold” microbenchmarks. The “hot” versions stress a single instance, whose data footprint fits into the processor’s L1 cache. The “cold” versions work with many instances, ensuring the working set does not fit into the cache. On modern machines, the cache hierarchy can be several hundred MBs. While an individual service on a multitenant machine may have to contend with neighbors also using cache space, a microbenchmark run on an otherwise idle machine can use the processors caches more completely. While neither of these extremes exactly represents production, they show the contrasts of the computational cost (arithmetic) and memory access costs required for hashtable operations.
perf stat -d path/to/benchmark
can provide a quick summary of important
performance counters, including cache misses.
$ bazel build -c opt //fleetbench/swissmap:all
$ perf stat -d bazel-bin/fleetbench/swissmap/hot_swissmap_benchmark \
--benchmark_filter=BM_FindMiss_Hot.*flat.*64.*16.*0
-----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------------------------
BM_FindMiss_Hot<::absl::flat_hash_set, 64>/set_size:16/density:0 1.36 ns 1.36 ns 518389760
Performance counter stats for 'bazel-bin/fleetbench/swissmap/hot_swissmap_benchmark...':
1,271.48 msec task-clock # 0.984 CPUs utilized
13 context-switches # 10.224 /sec
0 cpu-migrations # 0.000 /sec
435 page-faults # 342.120 /sec
5,299,599,301 cycles # 4.168 GHz (49.96%)
22,698,862,872 instructions # 4.28 insn per cycle (62.86%)
2,843,135,458 branches # 2.236 G/sec (63.17%)
2,161,178 branch-misses # 0.08% of all branches (63.20%)
5,548,863,804 L1-dcache-loads # 4.364 G/sec (63.19%)
1,304,138 L1-dcache-load-misses # 0.02% of all L1-dcache accesses (62.31%)**
153,564 LLC-loads # 120.775 K/sec (49.10%)
90,977 LLC-load-misses # 59.24% of all LL-cache accesses (49.07%)**
$ perf stat -d bazel-bin/fleetbench/swissmap/cold_swissmap_benchmark \
--benchmark_filter=BM_FindMiss_Cold.*flat.*64.*16.*0
------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------------------------
BM_FindMiss_Cold<::absl::flat_hash_set, 64>/set_size:16/density:0 **22.5 ns 22.4 ns** 37748952
Performance counter stats for 'bazel-bin/fleetbench/swissmap/cold_swissmap_benchmark...':
1,502.60 msec task-clock # 0.984 CPUs utilized
17 context-switches # 11.314 /sec
3 cpu-migrations # 1.997 /sec
111,346 page-faults # 74.102 K/sec
4,051,264,336 cycles # 2.696 GHz (50.00%)
4,765,057,295 instructions # 1.18 insn per cycle (62.51%)
567,678,472 branches # 377.799 M/sec (62.51%)
6,199,555 branch-misses # 1.09% of all branches (62.50%)
982,319,844 L1-dcache-loads # 653.749 M/sec (62.78%)
157,288,870 L1-dcache-load-misses # 16.01% of all L1-dcache accesses (62.51%)**
19,126,293 LLC-loads # 12.729 M/sec (49.96%)
15,576,980 LLC-load-misses # 81.44% of all LL-cache accesses (49.73%)**
The benchmark’s large working set made its workload harder to fit into cache, leading to corresponding increases in L1 and LLC misses. Due to the presence of cache misses, the same operation–looking up an absent key from a hashtable–went from ~1.36 nanoseconds to ~22.5 nanoseconds. Worse cache locality doesn’t just consume more memory bandwidth, it also hampers the performance and CPU efficiency of programs.
Additionally, Google’s C++ microbenchmark tool
(https://github.com/google/benchmark/blob/main/docs/user_guide.md) supports
collecting hardware performance counters. This can be helpful for
isolating the counters to when the benchmark is actually running and to
individual benchmark operations. In contrast, perf stat
captures the benchmark
harness’ own setup and its numbers cover the entirety of all microbenchmark
cases being run.
Our goal is to maximize the productivity of our accesses to cache lines, making sure our programs accomplish as much useful work as possible for each memory access. This is not just useful for reducing memory bandwidth needs, but it can improve CPU and RAM productivity as well.
Consider a simple linked list containing integers:
struct Node { Node* next; int value; }
In memory, this will fill part of a cache line (16 bytes out of 64) and each node is likely to fall on its own cache line, separate from other nodes. Over time, a program’s memory allocations will trend towards being random due to the mix of allocations and deletions that have left gaps.
This is inefficient to access in several ways:
next
pointers, while important to the linked list, aren’t holding
useful data.std::vector
,
accessing the first element means it likely also accessed the cache line
for the second and so on.While std::list
is quite rare, many containers have similar implicit
linked-list like layouts. Matt Kulukundis discussed this in his
2017 CppCon presentation on
Abseil’s hashtables: Containers like std::unordered_map
and to a lesser
extent, absl::node_hash_map
achieve pointer stability by separately allocating
each key-value pair. This guarantee comes at a cost, since the data structure
requires an extra memory access to get to the logical value the program is
accessing.
Similarly, one common pattern is std::vector<std::unique_ptr<T>>
where there
is an indirection that makes logically adjacent objects not colocated in memory.
It is important to look with a critical eye at whether this indirection is
necessary. Sometimes it exists because T is not copyable/movable. In that case
it may be possible to remove the indirection by fixing the type semantics or by
using a container type that allows non-movable content like absl::FixedArray
or std::deque
. In other cases the indirection may be used to exercise
polymorphic behavior, and techniques like tag-based dispatch may help. There is
no one size fits all solution for this but practical options do exist.
Copies are sometimes needed for API or correctness reasons, but extraneous ones require the processor to do more work. For example:
void DoStuff(Protobuf); void Process(const std::vector<Protobuf>& elements) { for (Protobuf element : elements) { // Creates a copy of each element. DoStuff(element); } }
void DoStuff(Protobuf); void Process(const std::vector<Protobuf>& elements) { for (const Protobuf& element : elements) { // Avoids a copy. DoStuff(element); } }
The program is copying each element from the vector and then processing that
copy, when no copy would have sufficed. This increases its working set–it has
all of elements
and element
–and increases pressure on the processor’s
cache, making it more likely that the program will trigger evictions and misses.
In addition to the demands on the memory system, the program is also less
productive in terms of its CPU usage–copying isn’t helping us process
faster–and RAM usage–its extraneous copy isn’t needed.
While this is just an example–and Clang Tidy can recommend fixes for this one automatically–copies of much larger data structures such as protobufs can commonly arise.
void Populate(MyProto* msg, int n) { for (int i = 0; i < n; ++i) { MySubmessage s; s.set_value(i); s.mutable_foo()->set_bar(i); *msg->add_submessage() = s; // Copies, including complex heap allocated structure. } }
Populating data in its final destination can minimize the working set size and improve efficiency:
void Populate(MyProto* msg, int n) { for (int i = 0; i < n; ++i) { MySubmessage& s = *msg->add_submessage(); // Places s in its destination. s.set_value(i); s.mutable_foo()->set_bar(i); } }
Similarly, many containers guarantee amortized growth, growing by a
multiplicative factor to achieve this guarantee. With each step in growth, we
allocate another buffer, move elements over, and deallocate the old one. Where
we know the final size, a call to reserve
can avoid these copies and reduce
our working set size.
As explained above, allocating objects contiguously can be particularly beneficial for memory bandwidth, especially if the access patterns will require us to traverse objects in sequence. However, the benefits will exist to the extent that the traversed objects are sufficiently small and can be packed neatly into cache lines.
In general, densifying data in a way that goes in accordance with its access patterns can cause improvements to the memory bandwidth consumption of an application. This lets us use each cache line we access to its fullest. Also, it helps as prefetchers tend to bring continuous or easy to predict cache lines into the cache.
Based on this principle and its implications, consider the following guidelines when writing an application:
int8_t
or int16_t
, rather than an int64_t
. Representing a
small set of values is best done with a narrow enum
, rather than a
std::string
.As a concrete example of SoA vs. AoS, consider the following code snippet (AoS):
struct Point3D { float x; float y; float z; }; float AddAllY(const std::vector<Point3D>& elements) { float result = 0; for (const auto& element : elements) { result += element.y; } return result; }
A more optimal alternative could be the following (SoA):
struct PointList3D { std::vector<float> x; std::vector<float> y; std::vector<float> z; }; float AddAllY(const PointList3D& elements) { float result = 0; for (auto element : elements.y) { // Traverse the array of "y" coordinates. result += element; } return result; }
As with CPU and RAM usage, memory bandwidth is another dimension that we can optimize for, maximizing the amount of useful work we can do with the minimal amount of memory bandwidth. Reducing cache misses and improving data locality can also bring improvements to CPU performance and reducing required RAM footprints.