Performance Tip of the Week #62: Identifying and reducing memory bandwidth needs

Originally posted as Fast TotW #62 on July 7, 2022

By Chris Kennelly, Luis Otero and Carlos Villavieja

Updated 2024-09-04

Quicklink: abseil.io/fast/62

To accomplish useful work with processors, programs need to access main memory. The rate that programs can transfer data to/from main memory has not grown as fast as processors getting greater per-core performance and more cores. The Memory Wall, long predicted, is here. Memory bandwidth is increasingly the bottleneck for how much useful work modern data centers can accomplish with its CPUs. Optimizing for memory bandwidth productivity can make code faster–by reducing memory stalls–and RAM footprints smaller. In this episode, we discuss techniques for identifying hotspots and reducing their impact.

Memory bandwidth basics

Reads and writes while a program is running trigger accesses to memory. Accessing RAM is slow–approximately 100 nanoseconds–so processors have added layers of caches in between the execution units and the physical modules of RAM. On x86, this started with a single kilobyte-sized cache on the 386, but modern processors have several layers of caches (L1/L2/L3). The closer to the processor, the smaller but faster the caches typically are.

The processor’s caches work at a granularity of a cache line, typically a 64-byte chunk of data. This simplifies the hardware–which can track whether a cache line is dirty and needs to be written to main memory at the cache line level–rather than say tracking byte-by-byte. This abstraction is also useful in practice–common primitive data types like int and T* all span several bytes. Accessing the cache line from main memory is an all-or-nothing affair: Loading any one byte means the whole line is read from memory.

In addition to a program’s loads and stores, the processor can issue prefetches for data at the hardware level. These prefetches need to be timely, the data needs to be ready before the program actually needs it, but not so early that it is evicted from the cache before it can be used. Prefetches also need to be useful: Pulling in a cache line that ends up being unaccessed consumes memory bandwidth without the benefit of avoiding a cache miss on the critical path.

The processor’s memory bus can handle thousands of transfers per second, limited by technology, power requirements, and physics. While cache sizes and the memory bus differ across different architectures, the same general principles apply. At the socket level, this can give the processor several GB/second of memory bandwidth. These windows to transfer data are akin to seats on an airplane, though: If the airline oversells a flight and everyone shows up at the airport, several passengers will be forced to wait for the next flight (in the case of CPUs there aren’t any refunds). Bursts in memory accesses, especially across many cores, can saturate our memory bus, even if the average over time is relatively low. Similarly, once a plane takes off, any unsold seats are wasted and cannot aid future bursts.

This perspective suggests two paths to reducing how much data needs to move to and from main memory: 1) access fewer things; 2) increase the reuse of the data the program accesses.

Making a program’s working set more cacheable allows it to be less bottlenecked on memory bandwidth and fewer cache misses reduce memory latency, allowing the program to speed up.

Finding bottlenecks

We can use data collected with the perf tool on a locally executed process.

While LLC misses are only an approximation–this event only tracks misses on loads–it can be quite effective for finding and fixing bottlenecks. Hardware performance counter limitations largely restrict to load events while simultaneously having precise attribution. Attribution to functions is what allows us to find source code-level optimizations. A cache miss on a load is most likely a read transaction on the memory bus and also probably corresponds to a previous cache eviction at some earlier point in time. The absolute numbers of misses are less important for finding places to look for optimizations than their relative rankings.

How to measure

A key part of optimization work is measuring the outcomes. While production is ultimately what matters, smaller scale proxies such as microbenchmarks can help with exploring the solution space.

One challenge for optimizing for cache misses with a microbenchmark is that the working set may be too small and its data will fit neatly into the processor’s caches. One strategy used for Abseil’s Hashtables is to have “hot” and “cold” microbenchmarks. The “hot” versions stress a single instance, whose data footprint fits into the processor’s L1 cache. The “cold” versions work with many instances, ensuring the working set does not fit into the cache. On modern machines, the cache hierarchy can be several hundred MBs. While an individual service on a multitenant machine may have to contend with neighbors also using cache space, a microbenchmark run on an otherwise idle machine can use the processors caches more completely. While neither of these extremes exactly represents production, they show the contrasts of the computational cost (arithmetic) and memory access costs required for hashtable operations.

perf stat -d path/to/benchmark can provide a quick summary of important performance counters, including cache misses.

$ bazel build -c opt //fleetbench/swissmap:all

$ perf stat -d bazel-bin/fleetbench/swissmap/hot_swissmap_benchmark \
    --benchmark_filter=BM_FindMiss_Hot.*flat.*64.*16.*0

-----------------------------------------------------------------------------------------------------------
Benchmark                                                                 Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------

BM_FindMiss_Hot<::absl::flat_hash_set, 64>/set_size:16/density:0       1.36 ns         1.36 ns    518389760

 Performance counter stats for 'bazel-bin/fleetbench/swissmap/hot_swissmap_benchmark...':

          1,271.48 msec task-clock                #    0.984 CPUs utilized
                13      context-switches          #   10.224 /sec
                 0      cpu-migrations            #    0.000 /sec
               435      page-faults               #  342.120 /sec
     5,299,599,301      cycles                    #    4.168 GHz                      (49.96%)
    22,698,862,872      instructions              #    4.28  insn per cycle           (62.86%)
     2,843,135,458      branches                  #    2.236 G/sec                    (63.17%)
         2,161,178      branch-misses             #    0.08% of all branches          (63.20%)
     5,548,863,804      L1-dcache-loads           #    4.364 G/sec                    (63.19%)
         1,304,138      L1-dcache-load-misses     #    0.02% of all L1-dcache accesses  (62.31%)**
           153,564      LLC-loads                 #  120.775 K/sec                    (49.10%)
            90,977      LLC-load-misses           #   59.24% of all LL-cache accesses  (49.07%)**

$ perf stat -d bazel-bin/fleetbench/swissmap/cold_swissmap_benchmark \
    --benchmark_filter=BM_FindMiss_Cold.*flat.*64.*16.*0

------------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------

BM_FindMiss_Cold<::absl::flat_hash_set, 64>/set_size:16/density:0       **22.5 ns         22.4 ns**     37748952

 Performance counter stats for 'bazel-bin/fleetbench/swissmap/cold_swissmap_benchmark...':

          1,502.60 msec task-clock                #    0.984 CPUs utilized
                17      context-switches          #   11.314 /sec
                 3      cpu-migrations            #    1.997 /sec
           111,346      page-faults               #   74.102 K/sec
     4,051,264,336      cycles                    #    2.696 GHz                      (50.00%)
     4,765,057,295      instructions              #    1.18  insn per cycle           (62.51%)
       567,678,472      branches                  #  377.799 M/sec                    (62.51%)
         6,199,555      branch-misses             #    1.09% of all branches          (62.50%)
       982,319,844      L1-dcache-loads           #  653.749 M/sec                    (62.78%)
       157,288,870      L1-dcache-load-misses     #   16.01% of all L1-dcache accesses  (62.51%)**
        19,126,293      LLC-loads                 #   12.729 M/sec                    (49.96%)
        15,576,980      LLC-load-misses           #   81.44% of all LL-cache accesses  (49.73%)**

The benchmark’s large working set made its workload harder to fit into cache, leading to corresponding increases in L1 and LLC misses. Due to the presence of cache misses, the same operation–looking up an absent key from a hashtable–went from ~1.36 nanoseconds to ~22.5 nanoseconds. Worse cache locality doesn’t just consume more memory bandwidth, it also hampers the performance and CPU efficiency of programs.

Additionally, Google’s C++ microbenchmark tool (https://github.com/google/benchmark/blob/main/docs/user_guide.md) supports collecting hardware performance counters. This can be helpful for isolating the counters to when the benchmark is actually running and to individual benchmark operations. In contrast, perf stat captures the benchmark harness’ own setup and its numbers cover the entirety of all microbenchmark cases being run.

Optimization strategies

Our goal is to maximize the productivity of our accesses to cache lines, making sure our programs accomplish as much useful work as possible for each memory access. This is not just useful for reducing memory bandwidth needs, but it can improve CPU and RAM productivity as well.

Avoid unnecessary indirections

Consider a simple linked list containing integers:

struct Node { Node* next; int value; }

In memory, this will fill part of a cache line (16 bytes out of 64) and each node is likely to fall on its own cache line, separate from other nodes. Over time, a program’s memory allocations will trend towards being random due to the mix of allocations and deletions that have left gaps.

This is inefficient to access in several ways:

  • The next pointers, while important to the linked list, aren’t holding useful data.
  • Accessing a cache line pulls in only a single 4-byte integer at a time. If the program were to use a contiguous data structure like std::vector, accessing the first element means it likely also accessed the cache line for the second and so on.
  • Due to alignment requirements, each node has 4 bytes of padding going unused.

While std::list is quite rare, many containers have similar implicit linked-list like layouts. Matt Kulukundis discussed this in his 2017 CppCon presentation on Abseil’s hashtables: Containers like std::unordered_map and to a lesser extent, absl::node_hash_map achieve pointer stability by separately allocating each key-value pair. This guarantee comes at a cost, since the data structure requires an extra memory access to get to the logical value the program is accessing.

Similarly, one common pattern is std::vector<std::unique_ptr<T>> where there is an indirection that makes logically adjacent objects not colocated in memory. It is important to look with a critical eye at whether this indirection is necessary. Sometimes it exists because T is not copyable/movable. In that case it may be possible to remove the indirection by fixing the type semantics or by using a container type that allows non-movable content like absl::FixedArray or std::deque. In other cases the indirection may be used to exercise polymorphic behavior, and techniques like tag-based dispatch may help. There is no one size fits all solution for this but practical options do exist.

Minimize copies

Copies are sometimes needed for API or correctness reasons, but extraneous ones require the processor to do more work. For example:

void DoStuff(Protobuf);

void Process(const std::vector<Protobuf>& elements) {
  for (Protobuf element : elements) { // Creates a copy of each element.
    DoStuff(element);
  }
}
void DoStuff(Protobuf);

void Process(const std::vector<Protobuf>& elements) {
  for (const Protobuf& element : elements) { // Avoids a copy.
    DoStuff(element);
  }
}

The program is copying each element from the vector and then processing that copy, when no copy would have sufficed. This increases its working set–it has all of elements and element–and increases pressure on the processor’s cache, making it more likely that the program will trigger evictions and misses. In addition to the demands on the memory system, the program is also less productive in terms of its CPU usage–copying isn’t helping us process faster–and RAM usage–its extraneous copy isn’t needed.

While this is just an example–and Clang Tidy can recommend fixes for this one automatically–copies of much larger data structures such as protobufs can commonly arise.

void Populate(MyProto* msg, int n) {
  for (int i = 0; i < n; ++i) {
    MySubmessage s;
    s.set_value(i);
    s.mutable_foo()->set_bar(i);
    *msg->add_submessage() = s; // Copies, including complex heap allocated structure.
  }
}

Populating data in its final destination can minimize the working set size and improve efficiency:

void Populate(MyProto* msg, int n) {
  for (int i = 0; i < n; ++i) {
    MySubmessage& s = *msg->add_submessage();  // Places s in its destination.
    s.set_value(i);
    s.mutable_foo()->set_bar(i);
  }
}

Similarly, many containers guarantee amortized growth, growing by a multiplicative factor to achieve this guarantee. With each step in growth, we allocate another buffer, move elements over, and deallocate the old one. Where we know the final size, a call to reserve can avoid these copies and reduce our working set size.

Densify data

As explained above, allocating objects contiguously can be particularly beneficial for memory bandwidth, especially if the access patterns will require us to traverse objects in sequence. However, the benefits will exist to the extent that the traversed objects are sufficiently small and can be packed neatly into cache lines.

In general, densifying data in a way that goes in accordance with its access patterns can cause improvements to the memory bandwidth consumption of an application. This lets us use each cache line we access to its fullest. Also, it helps as prefetchers tend to bring continuous or easy to predict cache lines into the cache.

Based on this principle and its implications, consider the following guidelines when writing an application:

  • Compact data structures as much as possible by reducing padding and eliminating fields which are never accessed.
  • When processing large amounts of data, narrower data types can increase data density when values have constrained ranges. Integers over a small range might fit in int8_t or int16_t, rather than an int64_t. Representing a small set of values is best done with a narrow enum, rather than a std::string.
  • If some fields are accessed more frequently than others, consider breaking down the objects into two distinct classes (i.e. “hot” and “cold”) such that arrays of hotter fields/objects are closer in memory.
  • In the extreme of this technique, prefer using Structure-of-Arrays (SoA) than Array-of-Structures (AoS) to define the layout of the data structures in your program.

As a concrete example of SoA vs. AoS, consider the following code snippet (AoS):

struct Point3D {
  float x;
  float y;
  float z;
};

float AddAllY(const std::vector<Point3D>& elements) {
  float result = 0;
  for (const auto& element : elements) {
    result += element.y;
  }
  return result;
}

A more optimal alternative could be the following (SoA):

struct PointList3D {
  std::vector<float> x;
  std::vector<float> y;
  std::vector<float> z;
};

float AddAllY(const PointList3D& elements) {
  float result = 0;
  for (auto element : elements.y) {  // Traverse the array of "y" coordinates.
    result += element;
  }
  return result;
}

Summary

As with CPU and RAM usage, memory bandwidth is another dimension that we can optimize for, maximizing the amount of useful work we can do with the minimal amount of memory bandwidth. Reducing cache misses and improving data locality can also bring improvements to CPU performance and reducing required RAM footprints.

Additional reading


Subscribe to the Abseil Blog