Performance Tip of the Week #83: Reducing memory indirections

Originally posted as Fast TotW #83 on June 17, 2024

Updated 2024-08-15

Memory indirections are a frequent cause for latency and memory bandwidth bottlenecks. By forcing the processor to follow pointers to get to the useful data it needs, our programs incur slowdowns and require more memory bandwidth than they might need from a more efficient layout. In this episode, we discuss tools for identifying inefficient data structures and improving them.

Latency and throughput

In Google’s fleet, our processors spend 40-50% of their time waiting for data coming from their caches or main memory. Memory access latency and throughput are inextricably linked:

As the software makes more accesses, it essentially rolls the dice each time that it will incur cache misses. Further, as the rate of accesses over time increases, the core sees higher access latencies due to memory bandwidth saturation.
For every access, the program incurs more latency if the processor needs to access data from an outermost cache or main memory.

We can tackle both of these factors as we think about how to lay out data in our programs. We can work to reduce accesses or choose layouts that increase the likelihood of something already being reused. Improving this allows us to do more useful work with fewer CPUs.

Helping the hardware help us

PMU events are proxies that we use for understanding accesses and identifying changes to code to make it friendlier to the cache hierarchy. Consider a linked list: The individual nodes are unlikely to be physically contiguous since they are allocated on the heap. Without this correlation, the hardware prefetcher can’t help us hide access latency (at best) and might even issue accesses to (logically) unrelated, neighboring cachelines that our program won’t need, wasting bandwidth.

While Arenas are an incredibly useful tool, it’s best to apply this after optimizing the data structure, not before. A more convoluted, but Arena-allocated data structure is likely worse than using the best one in the first place.

Surveying data structures

Consider the problem of finding an element in a std::list, a std::set, an absl::node_hash_set, or absl::flat_hash_set. We can construct the benchmark in two ways: doing lookups one after another with no dependency, or chaining the output of one lookup to influence the next to suss out latency characteristics.

template<typename T>
void BM_LookupSpeed(benchmark::State& state) {
  using ValueType = T::value_type;

  const int size = std::max<int>(1, state.range(0) / sizeof(ValueType));

  // Choose size random integers, then populate our container with it.
  absl::BitGen rng;
  std::vector<std::remove_cv_t<typename ValueType::first_type>> v;
  v.reserve(size);
  for (int i = 0; i < size; ++i) {
    v.push_back(absl::Uniform(rng, std::numeric_limits<typename ValueType::first_type>::min(),
                              std::numeric_limits<typename ValueType::first_type>::max()));
  }

  T container;
  for (auto u : v) {
    if constexpr (std::is_same<T, std::list<ValueType>>::value) {
      container.emplace_back(u, u);
    } else {
      container.emplace(u, u);
    }
  }

  const auto begin = container.begin();
  const auto end = container.end();

  // Use a small state PRNG for selecting indices to avoid confounding access
  // counts that might occur with larger state RNGs (like Mersenne Twister).
  std::minstd_rand lcg;
  uint32_t last = 0;
  uint64_t sum = 0;
  for (auto _ : state) {
    const size_t index = (lcg() + last) % size;
    const auto value = v[index];

    typename T::const_iterator it;
    if constexpr (std::is_same<T, std::list<ValueType>>::value) {
      it = std::find(begin, end, std::make_pair(value, value));
    } else {
      it = container.find(value);
    }
    ABSL_DCHECK(it != end);
    auto found = it->second;
    sum += found;

    // Introduce a deliberate dependency from this lookup on the key for the
    // next value.
    last = found;
  }

  ABSL_CHECK_NE(sum, 0);
}

// std::list is O(N), so reduce the data set size 100x for reasonable running
// time.  The reported numbers are per operation, so they are still comparable.
BENCHMARK_TEMPLATE(BM_LookupSpeed, std::list<std::pair<uint32_t, uint32_t>>)
   ->Range(1, benchmark::CPUInfo::Get().caches.back().size / 100);
BENCHMARK_TEMPLATE(BM_LookupSpeed, std::map<uint32_t, uint32_t>)
   ->Range(1, benchmark::CPUInfo::Get().caches.back().size);
BENCHMARK_TEMPLATE(BM_LookupSpeed, absl::node_hash_map<uint32_t, uint32_t>)
   ->Range(1, benchmark::CPUInfo::Get().caches.back().size);
BENCHMARK_TEMPLATE(BM_LookupSpeed, absl::flat_hash_map<uint32_t, uint32_t>)
   ->Range(1, benchmark::CPUInfo::Get().caches.back().size);

We can run this benchmark gathering PMU counters concurrently.

The full output from the benchmark:

name                                                              CYCLES/op
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/1          44.8 ±27%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/8          39.5 ±22%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/64         71.4 ± 6%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/512         181 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/4096      1.08k ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/32768     13.2k ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/262144     102k ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/403701     164k ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/1                      44.2 ±12%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/8                      47.1 ±21%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/64                     84.1 ± 8%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/512                     126 ± 3%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/4096                    162 ± 2%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/32768                   247 ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/262144                  426 ± 2%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/2097152                 780 ± 2%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/16777216              1.93k ± 2%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/40370176              2.53k ± 3%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/1           0.25 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/8           0.25 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/64          81.0 ± 9%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/512         76.8 ± 4%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/4096        77.0 ± 1%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/32768        103 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/262144       118 ± 1%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/2097152      289 ± 3%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/16777216     572 ± 2%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/40370176     871 ± 2%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/1           0.25 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/8           0.25 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/64          71.3 ± 9%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/512         72.1 ± 2%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/4096        71.7 ± 1%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/32768       86.5 ± 1%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/262144       102 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/2097152      210 ± 3%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/16777216     331 ± 3%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/40370176     546 ± 9%

name                                                              MEM_INST_RETIRED:ALL_LOADS/op
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/1          3.00 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/8          3.00 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/64         10.0 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/512        66.0 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/4096        514 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/32768     4.10k ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/262144    32.7k ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/403701    50.5k ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/1                      5.00 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/8                      5.00 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/64                     9.60 ± 7%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/512                    15.5 ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/4096                   21.5 ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/32768                  27.7 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/262144                 33.7 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/2097152                39.8 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/16777216               45.9 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/40370176               48.5 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/1           0.00 ±NaN%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/8           0.00 ±NaN%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/64          8.20 ± 4%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/512         8.00 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/4096        8.01 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/32768       8.01 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/262144      8.01 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/2097152     8.01 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/16777216    8.01 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/40370176    8.01 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/1           0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/8           0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/64          7.00 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/512         7.00 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/4096        7.01 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/32768       7.00 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/262144      7.00 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/2097152     7.00 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/16777216    7.00 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/40370176    7.01 ± 0%

name                                                              MEM_LOAD_RETIRED.L1_MISS/op
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/1          0.00 ±NaN%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/8          0.00 ±NaN%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/64         0.00 ±NaN%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/512        0.00 ±NaN%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/4096       0.01 ±22%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/32768       109 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/262144      705 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/403701    1.04k ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/1                      0.00 ±NaN%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/8                      0.00 ±NaN%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/64                     0.00 ±NaN%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/512                    0.00 ±NaN%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/4096                   0.23 ± 4%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/32768                  5.52 ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/262144                 9.45 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/2097152                12.9 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/16777216               16.3 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/40370176               17.7 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/1           0.00 ±NaN%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/8           0.00 ±NaN%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/64          0.00 ±NaN%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/512         0.00 ±NaN%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/4096        0.00 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/32768       2.30 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/262144      3.58 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/2097152     3.75 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/16777216    3.77 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/40370176    3.77 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/1           0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/8           0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/64          0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/512         0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/4096        0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/32768       1.20 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/262144      2.55 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/2097152     2.74 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/16777216    2.76 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/40370176    2.77 ± 0%

While there are clear differences in these containers due to their asymptotic complexity, the memory subsystem adds an appreciable scaling factor to their performance as well.

The microbenchmark ALL_LOADS data correlates with our expectations. A 256KB list of uint32_t’s has 64K elements. The benchmark needs to scan half of those on average to do a lookup, but the scan also requires reading the pointer to find the next node, doubling the number of accesses. This control data is cached but not intrinsically useful to our program. We need a different number of memory accesses (MEM_INST_RETIRED:ALL_LOADS) per lookup, depending on our data structure. For the hashtables, each lookup in our microbenchmark makes 4 accesses: 1 for the vector of values (part of our harness), 1 for the SwissMap’s control bytes, 1 for checking the key match, and 1 for finally retrieving the value. node has an additional indirection here.
L1_MISS lets us differentiate on the efficiency of accesses in terms of how many cachelines the program accesses. Many of the program’s loads are to the same cacheline–it needs to access both the metadata of pointers to the next node and a value, or it need to access both the key and the mapped value when the program finally looks up the value.

Finding unneeded indirections

To find unneeded indirections, we focus on heap allocations and their access patterns, since many accesses are to these data structures. Globals tend to have less complexity: A command line flag holding a value is fairly common, but a complex, allocation-free data structure is comparatively rare.

We are also looking for allocations that are accessed frequently. This is a qualitative, rather than quantitative threshold, depending on the circumstances. Flattening a data structure that is too cold might prove to be too costly in terms of RAM, or even harmful to cache efficiency.

Combining these two factors, we are often looking for places where the lifetime of one is a superset of the other. For example, a std::unique_ptr member variable might be initialized at construction and freed in the destructor. In contrast, something held in a std::shared_ptr might have an indeterminate lifetime, making it less suitable for coupling the two.

Allocations

Small, frequently made allocations are good candidates for optimization.

A small allocation is cacheline-inefficient, since we need storage for both the pointer to the allocation and the object itself (and potentially unrelated, neighboring data too).
Short lifetimes bound our error if we don’t use all of the data. They also give us a rough proxy for access frequency–the bytes of every allocation should generally be touched once, otherwise we could make the allocation smaller. Containers with amortized growth (like std::vector) might be larger than needed by a small constant factor (typically 2x), but a large constant factor would only occur if we called reserve with too large an argument.

We can find these by using TCMalloc’s deallocation profiler to explicitly filter on lifetime, or combine its allocation profile with heap profiles.

Examining frequent allocations can also help find opportunities for eliminating the allocation altogether.

  rpc::StreamStatsInfo* stats_info = new rpc::StreamStatsInfo;
  // ...populate stats_info...
  data.stats_info = stats_info;
  // ...populate data...
  GetStreamMultiplexer()->HandleBytes(&data);
  delete stats_info;

  rpc::StreamStatsInfo stats_info;
  // ...populate stats_info...
  data.stats_info = &stats_info;
  // ...populate data...
  GetStreamMultiplexer()->HandleBytes(&data);
  // stats_info destroyed when it goes out of scope

stats_info was only 120 bytes, so accessing its fields from data.stats_info might entail 4 cachelines: data.stats_info itself, and 2-3 for *stats_info depending on how it was aligned. Using the stack doesn’t zero out the cache footprint of the object, but it is far more likely that those cachelines will already be resident and their contiguous nature makes it far more prefetchable (if not).

In another case, we replaced a std::vector with an absl::InlinedVector:

  std::vector<const Table*> new_tables(count_of_new_objects);
  const Table* new_table = to;
  for (int i = count_of_new_objects - 1; i >= 0; i--) {
    new_tables[i] = new_table;
    new_table = new_table->parent();
  }

  absl::InlinedVector<const Table*, 4> new_tables(count_of_new_objects);
  const Table* new_table = to;
  for (int i = count_of_new_objects - 1; i >= 0; i--) {
    new_tables[i] = new_table;
    new_table = new_table->parent();
  }

The vector was only used for the duration of the function call and allocation profiles showed that all allocations were 32-bytes or smaller. While we can’t statically guarantee that, absl::InlinedVector gracefully handles the situation where we need more elements.

Summary

Data indirections are a frequent source of cache misses that can hurt performance. Using profiling tools, we can identify allocations that cause these misses and choose better data structures to reduce them.

Performance Guide

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections