Originally posted as Fast TotW #83 on June 17, 2024
Updated 2024-08-15
Quicklink: abseil.io/fast/83
Memory indirections are a frequent cause for latency and memory bandwidth bottlenecks. By forcing the processor to follow pointers to get to the useful data it needs, our programs incur slowdowns and require more memory bandwidth than they might need from a more efficient layout. In this episode, we discuss tools for identifying inefficient data structures and improving them.
In Google’s fleet, our processors spend 40-50% of their time waiting for data coming from their caches or main memory. Memory access latency and throughput are inextricably linked:
We can tackle both of these factors as we think about how to lay out data in our programs. We can work to reduce accesses or choose layouts that increase the likelihood of something already being reused. Improving this allows us to do more useful work with fewer CPUs.
PMU events are proxies that we use for understanding accesses and identifying changes to code to make it friendlier to the cache hierarchy. Consider a linked list: The individual nodes are unlikely to be physically contiguous since they are allocated on the heap. Without this correlation, the hardware prefetcher can’t help us hide access latency (at best) and might even issue accesses to (logically) unrelated, neighboring cachelines that our program won’t need, wasting bandwidth.
While Arenas are an incredibly useful tool, it’s best to apply this after optimizing the data structure, not before. A more convoluted, but Arena-allocated data structure is likely worse than using the best one in the first place.
Consider the problem of finding an element in a std::list
, a std::set
, an
absl::node_hash_set
, or absl::flat_hash_set
. We can construct the benchmark
in two ways: doing lookups one after another with no dependency, or chaining the
output of one lookup to influence the next to suss out
latency characteristics.
template<typename T> void BM_LookupSpeed(benchmark::State& state) { using ValueType = T::value_type; const int size = std::max<int>(1, state.range(0) / sizeof(ValueType)); // Choose size random integers, then populate our container with it. absl::BitGen rng; std::vector<std::remove_cv_t<typename ValueType::first_type>> v; v.reserve(size); for (int i = 0; i < size; ++i) { v.push_back(absl::Uniform(rng, std::numeric_limits<typename ValueType::first_type>::min(), std::numeric_limits<typename ValueType::first_type>::max())); } T container; for (auto u : v) { if constexpr (std::is_same<T, std::list<ValueType>>::value) { container.emplace_back(u, u); } else { container.emplace(u, u); } } const auto begin = container.begin(); const auto end = container.end(); // Use a small state PRNG for selecting indices to avoid confounding access // counts that might occur with larger state RNGs (like Mersenne Twister). std::minstd_rand lcg; uint32_t last = 0; uint64_t sum = 0; for (auto _ : state) { const size_t index = (lcg() + last) % size; const auto value = v[index]; typename T::const_iterator it; if constexpr (std::is_same<T, std::list<ValueType>>::value) { it = std::find(begin, end, std::make_pair(value, value)); } else { it = container.find(value); } ABSL_DCHECK(it != end); auto found = it->second; sum += found; // Introduce a deliberate dependency from this lookup on the key for the // next value. last = found; } ABSL_CHECK_NE(sum, 0); } // std::list is O(N), so reduce the data set size 100x for reasonable running // time. The reported numbers are per operation, so they are still comparable. BENCHMARK_TEMPLATE(BM_LookupSpeed, std::list<std::pair<uint32_t, uint32_t>>) ->Range(1, benchmark::CPUInfo::Get().caches.back().size / 100); BENCHMARK_TEMPLATE(BM_LookupSpeed, std::map<uint32_t, uint32_t>) ->Range(1, benchmark::CPUInfo::Get().caches.back().size); BENCHMARK_TEMPLATE(BM_LookupSpeed, absl::node_hash_map<uint32_t, uint32_t>) ->Range(1, benchmark::CPUInfo::Get().caches.back().size); BENCHMARK_TEMPLATE(BM_LookupSpeed, absl::flat_hash_map<uint32_t, uint32_t>) ->Range(1, benchmark::CPUInfo::Get().caches.back().size);
We can run this benchmark gathering PMU counters concurrently.
The full output from the benchmark:
name CYCLES/op
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/1 44.8 ±27%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/8 39.5 ±22%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/64 71.4 ± 6%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/512 181 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/4096 1.08k ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/32768 13.2k ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/262144 102k ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/403701 164k ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/1 44.2 ±12%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/8 47.1 ±21%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/64 84.1 ± 8%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/512 126 ± 3%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/4096 162 ± 2%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/32768 247 ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/262144 426 ± 2%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/2097152 780 ± 2%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/16777216 1.93k ± 2%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/40370176 2.53k ± 3%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/1 0.25 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/8 0.25 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/64 81.0 ± 9%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/512 76.8 ± 4%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/4096 77.0 ± 1%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/32768 103 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/262144 118 ± 1%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/2097152 289 ± 3%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/16777216 572 ± 2%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/40370176 871 ± 2%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/1 0.25 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/8 0.25 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/64 71.3 ± 9%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/512 72.1 ± 2%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/4096 71.7 ± 1%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/32768 86.5 ± 1%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/262144 102 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/2097152 210 ± 3%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/16777216 331 ± 3%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/40370176 546 ± 9%
name MEM_INST_RETIRED:ALL_LOADS/op
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/1 3.00 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/8 3.00 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/64 10.0 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/512 66.0 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/4096 514 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/32768 4.10k ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/262144 32.7k ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/403701 50.5k ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/1 5.00 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/8 5.00 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/64 9.60 ± 7%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/512 15.5 ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/4096 21.5 ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/32768 27.7 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/262144 33.7 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/2097152 39.8 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/16777216 45.9 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/40370176 48.5 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/1 0.00 ±NaN%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/8 0.00 ±NaN%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/64 8.20 ± 4%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/512 8.00 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/4096 8.01 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/32768 8.01 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/262144 8.01 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/2097152 8.01 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/16777216 8.01 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/40370176 8.01 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/1 0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/8 0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/64 7.00 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/512 7.00 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/4096 7.01 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/32768 7.00 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/262144 7.00 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/2097152 7.00 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/16777216 7.00 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/40370176 7.01 ± 0%
name MEM_LOAD_RETIRED.L1_MISS/op
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/1 0.00 ±NaN%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/8 0.00 ±NaN%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/64 0.00 ±NaN%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/512 0.00 ±NaN%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/4096 0.01 ±22%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/32768 109 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/262144 705 ± 0%
BM_LookupSpeed<std::list<std::pair<uint32_t, uint32_t>>>/403701 1.04k ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/1 0.00 ±NaN%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/8 0.00 ±NaN%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/64 0.00 ±NaN%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/512 0.00 ±NaN%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/4096 0.23 ± 4%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/32768 5.52 ± 1%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/262144 9.45 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/2097152 12.9 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/16777216 16.3 ± 0%
BM_LookupSpeed<std::map<uint32_t, uint32_t>>/40370176 17.7 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/1 0.00 ±NaN%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/8 0.00 ±NaN%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/64 0.00 ±NaN%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/512 0.00 ±NaN%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/4096 0.00 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/32768 2.30 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/262144 3.58 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/2097152 3.75 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/16777216 3.77 ± 0%
BM_LookupSpeed<absl::node_hash_map<uint32_t, uint32_t>>/40370176 3.77 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/1 0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/8 0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/64 0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/512 0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/4096 0.00 ±NaN%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/32768 1.20 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/262144 2.55 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/2097152 2.74 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/16777216 2.76 ± 0%
BM_LookupSpeed<absl::flat_hash_map<uint32_t, uint32_t>>/40370176 2.77 ± 0%
While there are clear differences in these containers due to their asymptotic complexity, the memory subsystem adds an appreciable scaling factor to their performance as well.
ALL_LOADS
data correlates with our expectations. A
256KB list of uint32_t
’s has 64K elements. The benchmark needs to scan
half of those on average to do a lookup, but the scan also requires
reading the pointer to find the next node, doubling the number of accesses.
This control data is cached but not intrinsically useful to our program. We
need a different number of memory accesses (MEM_INST_RETIRED:ALL_LOADS
)
per lookup, depending on our data structure. For the hashtables, each lookup
in our microbenchmark makes 4 accesses: 1 for the vector of values (part of
our harness), 1 for the SwissMap’s control bytes, 1 for checking the key
match, and 1 for finally retrieving the value. node
has an additional
indirection here.L1_MISS
lets us differentiate on the efficiency of accesses in terms of
how many cachelines the program accesses. Many of the program’s loads are to
the same cacheline–it needs to access both the metadata of pointers to the
next node and a value, or it need to access both the key and the mapped
value when the program finally looks up the value.To find unneeded indirections, we focus on heap allocations and their access patterns, since many accesses are to these data structures. Globals tend to have less complexity: A command line flag holding a value is fairly common, but a complex, allocation-free data structure is comparatively rare.
We are also looking for allocations that are accessed frequently. This is a qualitative, rather than quantitative threshold, depending on the circumstances. Flattening a data structure that is too cold might prove to be too costly in terms of RAM, or even harmful to cache efficiency.
Combining these two factors, we are often looking for places where the lifetime
of one is a superset of the other. For example, a std::unique_ptr
member
variable might be initialized at construction and freed in the destructor. In
contrast, something held in a std::shared_ptr
might have an indeterminate
lifetime, making it less suitable for coupling the two.
Small, frequently made allocations are good candidates for optimization.
std::vector
) might be
larger than needed by a small constant factor (typically 2x), but a large
constant factor would only occur if we called reserve
with too large an
argument.We can find these by using TCMalloc’s deallocation profiler to explicitly filter on lifetime, or combine its allocation profile with heap profiles.
Examining frequent allocations can also help find opportunities for eliminating the allocation altogether.
rpc::StreamStatsInfo* stats_info = new rpc::StreamStatsInfo; // ...populate stats_info... data.stats_info = stats_info; // ...populate data... GetStreamMultiplexer()->HandleBytes(&data); delete stats_info;
rpc::StreamStatsInfo stats_info; // ...populate stats_info... data.stats_info = &stats_info; // ...populate data... GetStreamMultiplexer()->HandleBytes(&data); // stats_info destroyed when it goes out of scope
stats_info
was only 120 bytes, so accessing its fields from data.stats_info
might entail 4 cachelines: data.stats_info
itself, and 2-3 for *stats_info
depending on how it was aligned. Using the stack doesn’t zero out the cache
footprint of the object, but it is far more likely that those cachelines will
already be resident and their contiguous nature makes it far more prefetchable
(if not).
In another case, we replaced a std::vector
with an absl::InlinedVector
:
std::vector<const Table*> new_tables(count_of_new_objects); const Table* new_table = to; for (int i = count_of_new_objects - 1; i >= 0; i--) { new_tables[i] = new_table; new_table = new_table->parent(); }
absl::InlinedVector<const Table*, 4> new_tables(count_of_new_objects); const Table* new_table = to; for (int i = count_of_new_objects - 1; i >= 0; i--) { new_tables[i] = new_table; new_table = new_table->parent(); }
The vector was only used for the duration of the function call and allocation
profiles showed that all allocations were 32-bytes or smaller. While we can’t
statically guarantee that, absl::InlinedVector
gracefully handles the
situation where we need more elements.
Data indirections are a frequent source of cache misses that can hurt performance. Using profiling tools, we can identify allocations that cause these misses and choose better data structures to reduce them.