Originally posted as Fast TotW #95 on July 1, 2025
Updated 2025-07-14
Quicklink: abseil.io/fast/95
Shared resources can cause surprising performance impacts on seemingly unchanged software. In this episode, we discuss how to anticipate these effects and externalities.
Workload changes can confound longitudinal analysis: If you optimize a library like protocol buffers, does spending more time in that code mean your optimization didn’t work or that the application now serves more load?
A/B tests can control for independent variables. Nevertheless, load balancing can throw a wrench into this. A client-side load balancing algorithm (like Weighted Round Robin) might observe the better performance of some tasks and send more requests to them.
For individual jobs, normalizing by useful work done can let us compare throughput-per-CPU or throughput-per-RAM to allow us to control for workload shifts.
Where we have shared resources at the process, machine, or even data center level, we want to design an experimentation scheme that allows us to treat the shared resources as part of our experiment and control groups too.
Cache misses carry two costs for performance: they cause our code to wait for the miss to resolve and every miss puts pressure on the memory subsystem for other code that we are not closely studying.
In a previous episode, we discussed an optimization to move a hot
absl::node_hash_set to absl::flat_hash_set. This change only affected one
particular type of queries going through the server. While we saw 5-7%
improvements for those types of queries, completely unmodified code paths for
other query types also showed an improvement, albeit smaller.
Measuring the latter effect required process-level partitioning, which selected
some processes to always use absl::node_hash_set and other processes to always
use absl::flat_hash_set. Without this technique, all requests handled by the
server coexist with a 50:50 mix of modified/unmodified code paths. The modified
code path would have shown a performance improvement from fewer cache misses on
its lookups. For query types unaffected by the change, their data structures
would have seen the same mix of cache pressure from neighboring modified and
unmodified requests, rather than ever seeing a 100% modified or 100% unmodified
neighbor. Per-request randomization would prevent us from measuring the “blast
radius” impact of our change. This happens all the time for any change that uses
a shared resource (aka most changes).
This challenge of experiment design can be especially pronounced with load tests. Without studying the right query distribution of different request types concurrently, we won’t see the full impact, both positive and negative, of our change.
Before Temeraire, TCMalloc’s hugepage-aware allocator, many teams had opted to not periodically release memory from TCMalloc’s page heap to maximize hugepage availability and CPU performance. Other, less CPU-intensive workloads released memory periodically to strike a different balance between CPU usage and RAM usage due to free pages resting on TCMalloc’s heap.
Several additional optimizations have since modified the behavior, but the initial rollout sought to maintain the same “release on request” characteristic of the old page heap to minimize tradeoffs. Ensuring that almost all applications used no more memory than they previously did allowed us to focus on CPU performance and a handful of edge cases in allocation patterns.
When we ran A/B experiments prior to the fleetwide rollout of Temeraire, we saw improvements in hugepage coverage, even for applications that did not periodically release memory.
Managing memory in a hugepage-aware fashion everywhere improved performance even where we did not anticipate substantial benefits. The new allocator allowed whole hugepages to be returned, avoiding physical memory fragmentation altogether, and allowed us to iteratively target already fragmented pages to satisfy requests.
These added benefits were only recognized by A/B testing at the machine level. While we had enabled Temeraire with the help of many eager, early adopters, we could only see the first order effect–direct impact on the modified application–and not the second order effect–better behaved neighbors–during those experiments.
In larger, distributed serving stacks, we might go further to partition resources to prevent confounding our results.
Managing and reducing contention for shared resources is a large source of performance opportunities. Nevertheless, recognizing these opportunities fully requires consideration of experiment design to ensure confounding factors do not obscure the real benefits from us.