Performance Tip of the Week #26: Fixing things with hashtable profiling

Originally posted as Fast TotW #26 on May 20, 2020

Updated 2025-07-16

As discussed by Matt Kulukundis in his 2019 CppCon talk, identifying a “slow” hashtable from a pure CPU profile can be challenging. Abseil’s C++ hashtables have a built-in profiler. In this episode we describe what insights about the hash function quality and hash collisions it can provide, making them discernible at scale. We also look at a couple of case studies where this information was used to improve Google’s production fleet efficiency.

Starting the analysis

We can gather hashtable profiles by collecting them from a server in production and loading them into pprof.

The data can be used to identify problematic hashtables. Bad hashtables usually come from a low-entropy hash function or using the same hash function for both the table and sharding across instances. These can be addressed by:

Using absl::Hash to provide a well-mixed hash function.
Using a distinct hash function by salting the hash.

Finding stuck bits

Stuck bits are bits that never change across every hash. Naturally, this increases the probability of a collision exponentially in the number of stuck bits.

We can look at a few examples to demonstrate causes of stuck hash functions.

Case study: `InternalSubscriber::MessageReceived`

This function showed up multiple times when searching for stuck bits. It’s in a client library used in many servers.

We see an insertion into an indexed array called pending_ack_callbacks:

pending_ack_callbacks_[i].shard.insert(ack_cb, ack_callback);

Its definition is an array of hashtables:

struct {
  ABSL_CACHELINE_ALIGNED absl::flat_hash_map<Closure*, Closure*> shard;
} pending_ack_callbacks_[kNumPendingAckShards];

where kNumPendingAckShards = 256. Suspiciously, our stuck bits were 0xff =

We choose our shard based on the hash:

/* static */
int InternalSubscriber::ClosureAddressToShard(Closure *address) {
  absl::flat_hash_set<Closure*>::hasher h;
  return h(address) % kNumPendingAckShards;
}

For the hashtable of any shard, all the bottom bits will be the same by definition. The hashtable uses these same bits to determine where to insert new values, so we see worse performance due to collisions.

This was fixed by salting the hash that determines sharding:

absl::Hash<std::pair<Closure*, int>> h;
return h(std::pair(address, 42)) % kNumPendingAckShards;

This salting technique provides distinct hash functions for shard lookup and table lookup. This halved the cost of insertions in this library.

Case study: `ProcessWatchDone`

In a different library, we found an unusual pattern of stuck bits: 0x4030802800000000. Before it was fixed, we found the declarations:

typedef absl::node_hash_map<NamespaceKey, int64, NamespaceKeyHash>
    NsKeyRequestIdMap;

Since it the code previously used std::unordered_map and NamespaceKey wasn’t hashable with std::hash, it specified a custom hasher, NamespaceKeyHash. It used using GoodFastHash:

size_t NamespaceKeyHash::operator()(const NamespaceKey& ns_key) const {
  return GoodFastHash<NamespaceKey>()(ns_key);
}

NamespaceKey is a pair of Cords. But GoodFastHash does not properly mix bits for std::pair:

typedef std::pair<absl::Cord, absl::Cord> NamespaceKey;

Migrating this to use absl::Hash fixed this by improving mixing on the std::pair.

Finding tables with many collisions

When a hashtable has more collisions, it has to do more probes to find a particular element for a lookup. With more collisions, lookup performance scales with O(n) rather than O(1). The required memory loads might be uncached, hurting performance.

A perfect probe length is 0, which means we found the object right where we first looked. If the average probe length is greater than 0.1, then the table is probing more often than should be necessary as ~10% of keys are encountering collisions. We can calculate average probe length by dividing the total_probe_length by the sum of the num_erases and size (these two numbers capture the total number of elements that have been inserted into the table).

Case study: `PortMapperImpl::UpdateActiveMap`

This one showed a max probe length of 133 and an average probe length of 8.4. This is effectively performing a linear search on many lookups.

We can look at where we are mutating the table:

// Allocate new ConnLabel element and update counters
iter = active_map_->find_or_insert(*flow);
iter->second = new ConnLabel;

active_map_’s definition points us to:

typedef priority_hash_map<IpFlow,
                          ConnLabel*,
                          IpFlowPrioritizer> ClientPortMap;

priority_hash_map is a custom type wrapping a SwissMap, but with less-than-ideal defaults:

template<class _Key,
         class _Val, // NOLINT - "Failed to find complete declaration of
                     //           class _Val  [build/class] [5]"
         class _PriorityFunc,
         class _PriorityKey = int,
         class _PriorityHash = hash<_PriorityKey>,
         class _PriorityEqual = std::equal_to<_PriorityKey>,
         class _PriorityCompare = std::less<_PriorityKey>,
         class _KeyHash = hash<_Key>,
         class _KeyEqual = std::equal_to<_Key> >
class priority_hash_map {
 public:
  typedef absl::flat_hash_map<_Key, _Val, _KeyHash, _KeyEqual> hash_map_type;
  ...

_PriorityHash and _KeyHash are using hash<>, which provides a custom hash specialization:

template<> struct hash<IpFlow> {
  size_t operator()(const IpFlow& flow) const {
    return flow.remote_ip()
           (flow.proto() << 24) ^ flow.local_port() ^
           (flow.remote_port() << 16);
 }
};

The xor‘d bits can lead to poor mixing. For example, all you’d need for a collision is an IP 10.0.0.33 on port 2038 and another IP 10.0.0.32 on port 2039. To fix this, we implemented AbslHashValue for IpFlow so priority_hash_map could switch to absl::Hash:

// Allow this to be used as a hash map key.
template<typename H>
friend H AbslHashValue(H h, const IpFlow& flow) {
 return H::combine(
     std::move(h),
     flow.remote_ip(),
     flow.proto(),
     flow.local_port(),
     flow.remote_port());
}

Hashtable statistics

For each profiled hashtable, we record statistics in several tags:

capacity tells the exact capacity of the hashtable.
size is the current number of elements in the hashtable.
The probe length statistics (max_probe_length, total_probe_length) tell us how many extra lookups were needed to find an element. In an ideal hashtable, we find the element we are looking for on the first lookup (probe length = 0). If there are collisions, we’ll have a higher probe length. max_probe_length tells us the worst case probe length for any element in the hashtable. total_probe_length is the sum of probe lengths for all elements of the hashtable.
stuck_bits is a bitmask that will indicate any bits for which the hash codes in the table have only seen one value (either zero or one). For a good hash function this should be 0 for any table >10 elements. This is implemented as a (running & of all hashes) | ~(running | of all hashes). A stuck bit indicates that our hash function may not be providing sufficient entropy.
num_rehashes records the number of times the table has been rehashed. Calling reserve with an appropriate size (close to the true size of the table) can avoid reduce hashes.
max_reserve records the maximum size passed to a call to reserve for the instance, or 0 if no such call has occurred. This can be useful for identifying tables that are too large (size << capacity) because they were called with too large a size. Similarly, tables with many rehashes can save on rehashing costs by calling reserve with a sufficient size.
num_erases is the number of elements that have been erased from the hashtable (since the last rehash). The sum of num_erases and size indicates the number of elements added to the table since the last rehash.
inline_element_size is the size of the elements of the flat part of the array. For flat hashtables, this is the size of the key-value pair. For node hashtables, this is the size of the pointer to the key-value pair.
key_size is equal to sizeof(key_type) for the table.
value_size is equal to sizeof(value_type) for the table. On sets, sizeof(key_type) == sizeof(value_type). On maps, value_type holds both key_type and mapped_type with appropriate padding for alignment.
soo_capacity is the number of elements in the table’s small-object optimization (SOO).
table_age reports the age in microseconds of the table.

Detecting bad hashtables in unit tests

One additional safeguard is that we can run the profiler while running our tests. This requires running with the environment variable ABSL_CONTAINER_FORCE_HASHTABLEZ_SAMPLE=1.

Since this triggers a test failure, its threshold for alerting is fairly high. Unit tests may not insert sufficiently large numbers of elements into the hashtables to produce collisions.

Closing words

Finding issues in hashtables from a CPU profile alone is challenging since opportunities may not appear prominently in the profile. SwissMap’s built-in hashtable profiling lets us peer into a number of hashtable properties to find low-quality hash functions, collisions, or excessive rehashing.

Performance Guide

Performance Hints

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #26: Fixing things with hashtable profiling

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate

Fast TotW #93: Robots never sleep

Fast TotW #94: Decision making in a data-imperfect world

Fast TotW #95: Spooky action at a distance

Fast TotW #97: Virtuous ecosystem cycles

Fast TotW #98: Measurement has an ROI

Fast TotW #99: Illuminating the processor core with llvm-mca

Performance Tip of the Week #26: Fixing things with hashtable profiling

Starting the analysis

Finding stuck bits

Case study: `InternalSubscriber::MessageReceived`

Case study: `ProcessWatchDone`

Finding tables with many collisions

Case study: `PortMapperImpl::UpdateActiveMap`

Hashtable statistics

Detecting bad hashtables in unit tests

Closing words

Fast Tips

Performance Tip of the Week #26: Fixing things with hashtable profiling

Starting the analysis

Finding stuck bits

Case study: InternalSubscriber::MessageReceived

Case study: ProcessWatchDone

Finding tables with many collisions

Case study: PortMapperImpl::UpdateActiveMap

Hashtable statistics

Detecting bad hashtables in unit tests

Closing words

Case study: `InternalSubscriber::MessageReceived`

Case study: `ProcessWatchDone`

Case study: `PortMapperImpl::UpdateActiveMap`