Performance Tip of the Week #93: Robots never sleep

Originally posted as Fast TotW #93 on May 27, 2025

Updated 2025-06-03

Techniques like presubmits are essential tools for effective software engineering. For performance optimization, changing observable but unspecified behaviors can be a large source of opportunities. In this episode, we discuss strategies for leaning on automation and additional tools to make it easier to evolve software.

Prevent problems via technical means

Use technical means, rather than (eventually fallible) humans, to prevent problems. Humans can be a valuable line of defense for unknown unknowns, but they cannot be the only line of defense for everything. While human-executed checklists are great, they also have to be short and that makes them unsuitable for encoding a bunch of knowledge about a problem domain. It can be tempting to add to the checklist after every incident and never subtract from it, but this leads to inevitable toil or skipped steps. Robots never get tired.

When we introduced GoodFastHash as a faster alternative to std::hash, it was intended that we could change the algorithm from time to time. This property was documented in a comment. Over time, we accreted code depending on its existing behavior. This ranged from brittle tests relying on the stable order of iteration of std::unordered_map paired with GoodFastHash to production designs that relied on it for stable cache keys. This stymied performance and hash quality improvements to GoodFastHash.

Automated testing is a well-known software technique, but it can be worth bringing the automation to bear in new ways on new problems to shift-left. Consider an example outside of performance: Many outages have been caused by large, unanticipated changes in configurations. For example, a 100.0% reduction in configuration size is probably a bug (and an outage waiting to happen). An empty configuration file can be detected mechanically, leaving humans to worry about all of the much more subtle gotchas. Presubmit checks that detect and block large diffs without an explicit exception can stop many issues in their tracks.

We can apply automation towards keeping code flexible over the long run, enabling us to make it more efficient while simultaneously ensuring reliability.

Strategies for taming subtlety

As the complexity of our software stack grows, tending to every corner reliably and manually is challenging. An API might promise one thing, but the software built on top of it relies on its implementation details much farther down the stack.

Compile-time hardening

The very first C++ TotW was for string_view, which can improve performance by avoiding string copies. This makes it attractive for regular usage, but it is bug-prone due to dangling references.

Reference-like types like absl::string_view and absl::Span allow us to write code agnostic to the underlying data storage. A string could be backed by constant data in a global or a heap allocated copy. A container of N elements might be stored as an absl::InlinedVector, removing a heap indirection in the common case.

As with other references, these types come with downsides: The usage of the reference might not coincide with the lifetime of the underlying storage, leading to use-after-free/use-after-return bugs. While sanitizers can detect these bugs, they can have false negatives depending on test coverage. Fortunately, source code annotations like ABSL_ATTRIBUTE_LIFETIME_BOUND can find many of them at compile time. These annotations allow us to shift-left on finding bugs, allowing the engineer actively writing code to fix the issue right away, rather than waiting for a production outage with its commensurately higher costs.

Runtime hardening

Over time, we’ve developed several techniques for warding off Hyrum’s Law. While there are any number of implementation details we can obfuscate, the following techniques have proven useful:

Memory addresses for low-cost entropy: Taking the address of a global is extremely efficient (a rip-relative lea on x86) and powers absl::Hash’s randomization. Having learned from the lessons of GoodFastHash, absl::Hash deliberately used a random number to seed its hash computation. This ensured tests were more robust because they could not rely on fragile implementation details. Consequently, it has been possible to land several optimizations to replace its implementation wholesale without being blocked by brittleness.

This introduces just enough entropy from ASLR and linker orderings to make it hard to rely on its values across tests and production tasks. Similarly, heap allocations can provide instance-specific entropy. We use this in SwissMap to ensure different table instances have different iteration orders.
Extra checks in debug builds / sampling: We want to maintain implementation freedom to ensure we can improve performance in future. However, expensive checks would reduce performance if done in optimized builds, defeating part of the purpose. For these, we can put most of our checks in debug builds.

Sized delete from C++14 allows our deallocation path to avoid a costly radix tree walk, but incorrect sizes would lead to data corruption. In debug builds, TCMalloc checks the provided size against its internal metadata groundtruth. TCMalloc already samples periodically to drive heap profiling, allowing it to provide extra checks on these sampled objects to proactively find further issues in production.
Counterfactual checks: In sanitizer builds, we check that SwissMap’s iterators remain valid after any potential invalidations. Even if a particular insert doesn’t cause a rehash but could have caused one, the iterator is considered invalid. This allows us to prevent dependencies on SwissMap’s growth rate, iterator invalidation behaviors, and so forth, even if the present implementation used in day-to-day builds overdelivers against its guarantees.

Enabling these defenses widely means we get benefits from everyone’s tests. These approaches ensure that code does not depend on characteristics that are sensitive to implementation details, so we can change implementations quickly without breaking code.

Static analysis

Tools like Clang-Tidy can spot problematic patterns and flag them at code review time. This can shift left on preventing problems, rather than waiting for a failed runtime check in production.

For example, some protobuf optimizations are sensitive to misuse of const_cast. This misuse is partly stopped by placing default instances in global, read-only storage, causing the program to crash if the instance is mutated. Since it isn’t ideal to crash on potentially under-tested code paths and because this technique doesn’t cover all types of misuse, a Clang-Tidy check can flag const_cast where the type is a protobuf instance.

Ratchet-and-pawl

Matt Kulukundis popularized the phrase “ratchet-and-pawl” to describe incremental migrations that prevented backsliding. This allows us to make progress where we can, secure in the knowledge that the problem will not get worse.

During the SwissMap migration, it was infeasible to randomize all existing unordered containers due to failing tests and actual production dependencies on the current behavior. Individual instances could be migrated where tests passed (or were made to pass). As the migration progressed, visibility allowlists on lesser used containers (dense_hash_map) and Clang Tidy on more common ones (std::unordered_map) reduced new usages of the legacy containers. An allowlist with a thousand entries might seem inelegant, but it’s a powerful tool for preventing backsliding. The list can be ratcheted down over time by hand or by automated cleanups as uses are removed.

Costs of avoided guarantees

Preserving implementation freedom carries costs, whether due to performance overheads when we need to work around it, opportunity costs, or simply worse ergonomics.

With hashtables, changing hash algorithms, initial sizes, and growth rates had come up time and time again, stymied by brittle tests, prior to the introduction of SwissMap and Abseil Hash. SwissMap using its heap allocation’s address as an entropy source makes copies more expensive on primitive types: Rather than simply memcpy‘ing the raw data, we need to rehash our keys. Making the order more deterministic could let us improve performance, but the benefits outweigh these small costs.

As useful as it is to maintain as much implementation flexibility, it is important to focus on the implementation details most likely to change. Just because we can obscure an implementation detail doesn’t necessarily mean it’s worth the runtime and engineering cost if it’s unlikely to ever change or valuable to do so.

Validating policy choices

For large, performance-critical optimizations, we may want to carefully test that the desired performance characteristics remain, even where the code would be functionally correct with or without the optimization.

TCMalloc’s tests often relied on making lots of actual allocations to exercise edge cases, but with the development of Temeraire, tests could exercise the policies to simulate gigabytes of memory usage without actually using that much physical memory. Leveraging this design choice allowed a wider variety of edge cases to be tested and regressions avoided as new improvements are brought online.
Protobuf elides copies in several situations. Tests ensure that this implementation detail is preserved to avoid substantial regressions.

If we like our optimizations, we should put a test on it. Automation prevents regressions upfront, rather than waiting for after-the-fact debugging.

Closing words

Shifting problem-finding away from humans and onto automated mechanisms lets us focus on the bigger picture and improve our velocity. When things do break, we save human time in the long-run because we can spend less effort figuring out what caused a regression.

Performance Guide

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate

Fast TotW #93: Robots never sleep