Performance Tip of the Week #95: Spooky action at a distance

Originally posted as Fast TotW #95 on July 1, 2025

Updated 2025-07-14

Shared resources can cause surprising performance impacts on seemingly unchanged software. In this episode, we discuss how to anticipate these effects and externalities.

Normalization techniques

Workload changes can confound longitudinal analysis: If you optimize a library like protocol buffers, does spending more time in that code mean your optimization didn’t work or that the application now serves more load?

A/B tests can control for independent variables. Nevertheless, load balancing can throw a wrench into this. A client-side load balancing algorithm (like Weighted Round Robin) might observe the better performance of some tasks and send more requests to them.

In absolute terms, the total CPU usage of both experiment arms might be unchanged.
In relative terms, there may be small, second order changes in relative time (for example, %-of-time in Protobuf). This might hide the true significance of the optimization or worse, lead us to believe the change is a regression when it is not.

For individual jobs, normalizing by useful work done can let us compare throughput-per-CPU or throughput-per-RAM to allow us to control for workload shifts.

Choosing the right partitioning scheme

Where we have shared resources at the process, machine, or even data center level, we want to design an experimentation scheme that allows us to treat the shared resources as part of our experiment and control groups too.

Cache locality and memory bandwidth

Cache misses carry two costs for performance: they cause our code to wait for the miss to resolve and every miss puts pressure on the memory subsystem for other code that we are not closely studying.

In a previous episode, we discussed an optimization to move a hot absl::node_hash_set to absl::flat_hash_set. This change only affected one particular type of queries going through the server. While we saw 5-7% improvements for those types of queries, completely unmodified code paths for other query types also showed an improvement, albeit smaller.

Measuring the latter effect required process-level partitioning, which selected some processes to always use absl::node_hash_set and other processes to always use absl::flat_hash_set. Without this technique, all requests handled by the server coexist with a 50:50 mix of modified/unmodified code paths. The modified code path would have shown a performance improvement from fewer cache misses on its lookups. For query types unaffected by the change, their data structures would have seen the same mix of cache pressure from neighboring modified and unmodified requests, rather than ever seeing a 100% modified or 100% unmodified neighbor. Per-request randomization would prevent us from measuring the “blast radius” impact of our change. This happens all the time for any change that uses a shared resource (aka most changes).

This challenge of experiment design can be especially pronounced with load tests. Without studying the right query distribution of different request types concurrently, we won’t see the full impact, both positive and negative, of our change.

Better hugepage coverage lifted all boats

Before Temeraire, TCMalloc’s hugepage-aware allocator, many teams had opted to not periodically release memory from TCMalloc’s page heap to maximize hugepage availability and CPU performance. Other, less CPU-intensive workloads released memory periodically to strike a different balance between CPU usage and RAM usage due to free pages resting on TCMalloc’s heap.

Several additional optimizations have since modified the behavior, but the initial rollout sought to maintain the same “release on request” characteristic of the old page heap to minimize tradeoffs. Ensuring that almost all applications used no more memory than they previously did allowed us to focus on CPU performance and a handful of edge cases in allocation patterns.

When we ran A/B experiments prior to the fleetwide rollout of Temeraire, we saw improvements in hugepage coverage, even for applications that did not periodically release memory.

Managing memory in a hugepage-aware fashion everywhere improved performance even where we did not anticipate substantial benefits. The new allocator allowed whole hugepages to be returned, avoiding physical memory fragmentation altogether, and allowed us to iteratively target already fragmented pages to satisfy requests.

These added benefits were only recognized by A/B testing at the machine level. While we had enabled Temeraire with the help of many eager, early adopters, we could only see the first order effect–direct impact on the modified application–and not the second order effect–better behaved neighbors–during those experiments.

Distributed systems

In larger, distributed serving stacks, we might go further to partition resources to prevent confounding our results.

For example, if we send more requests to server backends during an A/B experiment, the effect of higher load will be smeared across all of the backends, affecting both the experiment and control group’s latency. Several teams use a “stripes” setup for this purpose, partitioning the serving stack so that once a request is routed to one particular partition, it does not cross back into other partitions.
A change to job scheduling might impact how work is distributed to individual machines in a data center. If we were to improve performance on some machines, effectively lowering their utilization, the control plane might shift machine assignments to rebalance. While jobs may still have effectively the same amount of CPU time, a busier machine can translate into lower per-core performance due to memory bandwidth, cache contention, and frequency scaling effects that create a headwind for the experiment group.

Closing words

Managing and reducing contention for shared resources is a large source of performance opportunities. Nevertheless, recognizing these opportunities fully requires consideration of experiment design to ensure confounding factors do not obscure the real benefits from us.

Performance Guide

Performance Hints

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #26: Fixing things with hashtable profiling

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate

Fast TotW #93: Robots never sleep

Fast TotW #94: Decision making in a data-imperfect world

Fast TotW #95: Spooky action at a distance

Fast TotW #97: Virtuous ecosystem cycles

Fast TotW #98: Measurement has an ROI

Fast TotW #99: Illuminating the processor core with llvm-mca