Originally posted as Fast TotW #26 on May 20, 2020
By Chris Kennelly and Zie Weaver
Updated 2025-07-16
Quicklink: abseil.io/fast/26
As discussed by Matt Kulukundis in his 2019 CppCon talk, identifying a “slow” hashtable from a pure CPU profile can be challenging. Abseil’s C++ hashtables have a built-in profiler. In this episode we describe what insights about the hash function quality and hash collisions it can provide, making them discernible at scale. We also look at a couple of case studies where this information was used to improve Google’s production fleet efficiency.
We can gather hashtable profiles by collecting them from a server in production and loading them into pprof.
The data can be used to identify problematic hashtables. Bad hashtables usually come from a low-entropy hash function or using the same hash function for both the table and sharding across instances. These can be addressed by:
absl::Hash to provide a
well-mixed hash function.Stuck bits are bits that never change across every hash. Naturally, this increases the probability of a collision exponentially in the number of stuck bits.
We can look at a few examples to demonstrate causes of stuck hash functions.
InternalSubscriber::MessageReceivedThis function showed up multiple times when searching for stuck bits. It’s in a client library used in many servers.
We see an insertion into an indexed array called pending_ack_callbacks:
pending_ack_callbacks_[i].shard.insert(ack_cb, ack_callback);
Its definition is an array of hashtables:
struct {
ABSL_CACHELINE_ALIGNED absl::flat_hash_map<Closure*, Closure*> shard;
} pending_ack_callbacks_[kNumPendingAckShards];
where kNumPendingAckShards = 256. Suspiciously, our stuck bits were 0xff =
/* static */
int InternalSubscriber::ClosureAddressToShard(Closure *address) {
absl::flat_hash_set<Closure*>::hasher h;
return h(address) % kNumPendingAckShards;
}
For the hashtable of any shard, all the bottom bits will be the same by definition. The hashtable uses these same bits to determine where to insert new values, so we see worse performance due to collisions.
This was fixed by salting the hash that determines sharding:
absl::Hash<std::pair<Closure*, int>> h; return h(std::pair(address, 42)) % kNumPendingAckShards;
This salting technique provides distinct hash functions for shard lookup and table lookup. This halved the cost of insertions in this library.
ProcessWatchDoneIn a different library, we found an unusual pattern of stuck bits:
0x4030802800000000. Before it was fixed, we found the declarations:
typedef absl::node_hash_map<NamespaceKey, int64, NamespaceKeyHash>
NsKeyRequestIdMap;
Since it the code previously used std::unordered_map and NamespaceKey wasn’t
hashable with std::hash, it specified a custom hasher, NamespaceKeyHash. It
used using GoodFastHash:
size_t NamespaceKeyHash::operator()(const NamespaceKey& ns_key) const {
return GoodFastHash<NamespaceKey>()(ns_key);
}
NamespaceKey is a pair of Cords. But GoodFastHash does not properly mix
bits for std::pair:
typedef std::pair<absl::Cord, absl::Cord> NamespaceKey;
Migrating this to use absl::Hash fixed this by improving mixing on the
std::pair.
When a hashtable has more collisions, it has to do more probes to find a particular element for a lookup. With more collisions, lookup performance scales with O(n) rather than O(1). The required memory loads might be uncached, hurting performance.
A perfect probe length is 0, which means we found the object right where we
first looked. If the average probe length is greater than 0.1, then the table is
probing more often than should be necessary as ~10% of keys are encountering
collisions. We can calculate average probe length by dividing the
total_probe_length by the sum of the num_erases and size (these two
numbers capture the total number of elements that have been inserted into the
table).
PortMapperImpl::UpdateActiveMapThis one showed a max probe length of 133 and an average probe length of 8.4. This is effectively performing a linear search on many lookups.
We can look at where we are mutating the table:
// Allocate new ConnLabel element and update counters iter = active_map_->find_or_insert(*flow); iter->second = new ConnLabel;
active_map_’s definition points us to:
typedef priority_hash_map<IpFlow,
ConnLabel*,
IpFlowPrioritizer> ClientPortMap;
priority_hash_map is a custom type wrapping a SwissMap, but with
less-than-ideal defaults:
template<class _Key,
class _Val, // NOLINT - "Failed to find complete declaration of
// class _Val [build/class] [5]"
class _PriorityFunc,
class _PriorityKey = int,
class _PriorityHash = hash<_PriorityKey>,
class _PriorityEqual = std::equal_to<_PriorityKey>,
class _PriorityCompare = std::less<_PriorityKey>,
class _KeyHash = hash<_Key>,
class _KeyEqual = std::equal_to<_Key> >
class priority_hash_map {
public:
typedef absl::flat_hash_map<_Key, _Val, _KeyHash, _KeyEqual> hash_map_type;
...
_PriorityHash and _KeyHash are using hash<>, which provides a custom hash
specialization:
template<> struct hash<IpFlow> {
size_t operator()(const IpFlow& flow) const {
return flow.remote_ip()
(flow.proto() << 24) ^ flow.local_port() ^
(flow.remote_port() << 16);
}
};
The xor‘d bits can lead to poor mixing. For example, all you’d need for a
collision is an IP 10.0.0.33 on port 2038 and another IP 10.0.0.32 on port
2039. To fix this, we implemented AbslHashValue for IpFlow so
priority_hash_map could switch to absl::Hash:
// Allow this to be used as a hash map key.
template<typename H>
friend H AbslHashValue(H h, const IpFlow& flow) {
return H::combine(
std::move(h),
flow.remote_ip(),
flow.proto(),
flow.local_port(),
flow.remote_port());
}
For each profiled hashtable, we record statistics in several tags:
capacity tells the exact capacity of the hashtable.size is the current number of elements in the hashtable.max_probe_length, total_probe_length) tell
us how many extra lookups were needed to find an element. In an ideal
hashtable, we find the element we are looking for on the first lookup
(probe length = 0). If there are collisions, we’ll have a higher probe
length. max_probe_length tells us the worst case probe length for any
element in the hashtable. total_probe_length is the sum of probe lengths
for all elements of the hashtable.stuck_bits is a bitmask that will indicate any bits for which the hash
codes in the table have only seen one value (either zero or one). For a good
hash function this should be 0 for any table >10 elements. This is
implemented as a (running & of all hashes) | ~(running | of all
hashes). A stuck bit indicates that our hash function may not be providing
sufficient entropy.num_rehashes records the number of times the table has been rehashed.
Calling reserve with an appropriate size (close to the true size of the
table) can avoid reduce hashes.max_reserve records the maximum size passed to a call to reserve for the
instance, or 0 if no such call has occurred. This can be useful for
identifying tables that are too large (size << capacity) because they were
called with too large a size. Similarly, tables with many rehashes can save
on rehashing costs by calling reserve with a sufficient size.num_erases is the number of elements that have been erased from the
hashtable (since the last rehash). The sum of num_erases and size
indicates the number of elements added to the table since the last rehash.inline_element_size is the size of the elements of the flat part of the
array. For flat hashtables, this is the size of the key-value pair. For node
hashtables, this is the size of the pointer to the key-value pair.key_size is equal to sizeof(key_type) for the table.value_size is equal to sizeof(value_type) for the table. On sets,
sizeof(key_type) == sizeof(value_type). On maps, value_type holds both
key_type and mapped_type with appropriate padding for alignment.soo_capacity is the number of elements in the table’s small-object
optimization (SOO).table_age reports the age in microseconds of the table.One additional safeguard is that we can run the profiler while running our
tests. This requires running with the environment variable
ABSL_CONTAINER_FORCE_HASHTABLEZ_SAMPLE=1.
Since this triggers a test failure, its threshold for alerting is fairly high. Unit tests may not insert sufficiently large numbers of elements into the hashtables to produce collisions.
Finding issues in hashtables from a CPU profile alone is challenging since opportunities may not appear prominently in the profile. SwissMap’s built-in hashtable profiling lets us peer into a number of hashtable properties to find low-quality hash functions, collisions, or excessive rehashing.