Performance Tip of the Week #60: In-process profiling: lessons learned

Originally posted as Fast TotW #60 on June 6, 2022

Updated 2025-03-24

Google-Wide Profiling collects data not just from our hardware performance counters, but also from in-process profilers.

In-process profilers can give deeper insights about the state of the program that are hard to observe from the outside, such as lock contention, where memory was allocated, and the distribution of collisions on a hashtable. In this tip we discuss how to determine that a new profiler is necessary, and the best practices for producing one.

Overview

“The purpose of computing is insight, not numbers.” – Richard Hamming

Developing a new profiler and augmenting existing ones allows us to have more information to make optimization decisions and aid debugging. The goal isn’t to have perfect information and to make perfect decisions, but to make better decisions faster, shortening our “OODA loop” (Observe Orient Decide Act). The value is in pulling in the area-under-curve and landing in a better spot. An “imperfect” profiler that can help make a decision is better than a “perfect” profiler that is unwieldy to collect for performance or privacy reasons. Extra information or precision is only useful insofar as it helps us make a better decision or changes the outcome.

For example, most new optimizations to TCMalloc start from adding new data points to TCMalloc’s statistics that are collected by visiting malloc profile handlers across the fleet. This information helps with understanding the scope of a particular phenomenon. After landing an optimization, these metrics can help provide indicators that we changed what we set out to change, even if the actual CPU and RAM savings might be measured by other means. These steps didn’t directly save any CPU usage or bytes of RAM, but they enabled better decisions. Capabilities are harder to directly quantify, but they are the motor of progress.

Leveraging existing profilers: the “No build” option

Developing a new profiler takes considerable time, both in terms of implementation and wallclock time to ready the fleet for collection at scale. Before moving to implement one, it is valuable to consider whether we can derive the necessary information from existing profilers and tools we already have.

For example, if the case for hashtable profiling was just reporting the capacity of hashtables, then we could also derive that information from heap profiles, TCMalloc’s heap profiles of the fleet. Even where heap profiles might not be able to provide precise insights–the actual “size” of the hashtable, rather than its capacity–we can make an informed guess from the profile combined with knowledge about the typical load factors due to SwissMap’s design.

It is important to articulate the value of the new profiler over what is already provided. A key driver for hashtable-specific profiling is that the CPU profiles of a hashtable with a bad hash function look similar to those with a good hash function. The added information collected for stuck bits helps us drive optimization decisions we wouldn’t have been able to make. The capacity information collected during hashtable-profiling is incidental to the profiler’s richer, hashtable-specific details, but wouldn’t be a particularly compelling reason to collect it on its own given the redundant information available from ordinary heap profiles.

Sampling strategies

A key design aspect of a profiler is deciding when and how to collect information. Most profilers do some kind of sampling to provide an estimate of the total without the overhead of recording every event. Collecting some data, even if heavily sampled, can be useful for gauging behaviors of a library at scale. There are two aspects to a profiler that need to be decided up front:

Duration versus duration-less: Several of our profilers track sampled events over a period of time. Other profilers capture an instantaneous snapshot of the program’s state.

Duration-less handlers are profiling during the entire program lifetime, which imposes a higher bar on the stability and overhead that can be required. In contrast, a profiler that is only active during collection can be more expensive, as collecting itself is rare.
Sampling strategy: The overheads of capturing data about every instance of a class can be prohibitive, so it’s important to determine a strategy for sampling - and figure out how that sampling strategy can be scaled to provide information representative of the entire fleet.

Sampling operations can make-or-break the feasibility of a profiler: Our compression profiler originally recorded statistics about every compression operation during its collection window, but the high overhead caused major services to turn off the profiler altogether. It was fixed by moving to a sampling strategy to only record statistics on a subset of compression operations during the profiling window. This allowed the profiler to be reenabled.

While knobs are often undesirable, allowing applications to tune their sampling rate (or whether sampling occurs at all) can be helpful for balancing the information gained against the overheads imposed.

Unless there is a justified exception, we require that the profiler applies the sampling factor back to the data to “unsample” it before returning it. This allows consumers to easily use the data without having to deal with the sampling arithmetic themselves. This is especially important as sampling rate can be variable–either automatically adjusted or tunable via a configuration knob such as a flag. This step can also help with validation via cross-checking with other profilers. For example, SwissMap’s total memory allocations seen by TCMalloc’s heap profiles are consistent with the total capacity seen by the hashtable profiler.

Choosing the right sampling approach (and the unsampling counterpart) needs to carefully balance accuracy vs. overhead. For example, with the heap profiler in TCMalloc one might decide to simply pick every Nth allocation. But that would not work well: in a typical app memory profiles are dominated by many small allocations and sampling those with reasonable overhead would require high sampling factor. It is also likely to miss more rare large allocations. Interestingly, the next obvious improvement of sampling an allocation every N bytes would “almost work” but is subject to statistical bias. This was fixed by introducing Poisson sampler which is used to date.

For libraries with significant global process state, the threads running in a process or the state of malloc, we may use a more exhaustive strategy. For example, a profiler could snapshot and summarize the state of the library without further sampling.

What data to record

In addition to choosing a sampling strategy, we need to decide what data to collect. We want to choose data that will influence an optimization decision. Just as different optimization projects have varying returns on investment, we want to strike a balance between the cost of implementing our profiler, of running it, and implementing the optimizations it motivates.

Mutation operations can be an excellent place to record additional statistics on sampled instances. These are frequently heavyweight for unrelated reasons–they trigger copies and reallocations–so checking whether to record statistics has minimal added performance penalty. This is the strategy we use for many of our existing profilers. In contrast, non-mutating operations, such as hashtable lookups, can be prohibitively expensive as we use these operations frequently and rely on them being fast.

There is a cost-benefit tradeoff for having more information. Sampling more frequently or collecting more data with each sample can paint a richer picture, but this increases the runtime cost of profiling. TCMalloc’s heap profiling has low, but non-zero costs, but it more than pays for itself by allowing us to look at where much of our RAM usage goes. Increasing the sampling rate would give us extra precision, but it wouldn’t materially affect the optimizations we can uncover and deploy. The extra overheads would negatively impact performance.

More practically, a minimal set of information can be a good starting point for getting a new profiler up and running to start debugging it. While obvious in hindsight, several new profilers have hit issues with their stack trace collection and filtering. While collecting more data can give additional insights, implementations that compute too many statistics or add contended locks may simply be infeasible. A profiler that is too expensive to leave enabled may be worse than no profiler at all: We spend time implementing it and rolling it out, but we lose the visibility into the library usage that we were after in the first place.

Summary

Profilers are a useful tool for probing the internal state of a program to answer questions during debugging and optimization. The types of questions posed can greatly influence the design and architecture of a profiler.

While a particular design may not be able to answer all questions, all at once, the goal is ultimately to make better decisions faster, shortening our “OODA loop” (Observe Orient Decide Act). Just as optimization projects are framed in terms of return-on-investment, we can frame how additional information influences or changes course of a decision.

Performance Guide

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate