Performance Tip of the Week #53: Precise C++ benchmark measurements with Hardware Performance Counters

Originally posted as Fast TotW #53 on October 14, 2021

Updated 2024-11-19

Use performance benchmarks as the first line of defense in detecting costly regressions, and as a way to guide performance improvement work. Getting to the root cause of a change in performance can be time consuming and full of “false leads”, because on modern architectures program execution is influenced by many factors.

In this episode, we present a productivity tool that helps lower the cost of performance investigations by leveraging Hardware Performance Counters to surface low-level architectural metrics. The tool is available for C++ benchmarks running on Linux, on GitHub.

What are Hardware Performance Counters?

Hardware Performance Counters are a hardware feature where you can request precise counts of events such as: instructions retired, load or store instructions retired, clock cycles, cache misses, branches taken or mispredicted, etc. See https://perf.wiki.kernel.org/index.php/Tutorial for more information.

With performance counters you get less noisy measurements when compared to time-based ones. CPU timer-based measurements are noisier, even on isolated machines, because:

performance counter measurements can be isolated to the benchmark process, and, thus, not account context switching time when the process is preempted - which is otherwise measured by the CPU timer,
we can further isolate counter increments to user mode executed instructions only, which further reduces noise due to context switching
specific counters (e.g. instruction-counting ones) inherently produce measurements that are almost noise-free (variations under 0.01%). This is because the value of such counters is independent of systemic sources of noise like frequency throttling.
finally, depending on the setup of the benchmarking machine, time-based measurements suffer from noise introduced by thermal effects, hyperthreading, or shared resource (like memory bus) access. Some counters will also suffer from noise due to these, but others - like instructions retired counters - won’t.

By selecting appropriate performance counters you can get nuanced insight into the execution of a benchmark. For instance, a measurement using CPU time that points to a regression may be caused by subtle changes in executable layout, which increases branch mispredictions. This is generally not actionable and considered acceptable. Identifying this is the case, when only looking at time measurements, is not very productive and not scalable over a large benchmark suite corpus. With performance counter-based measurements, it is immediately apparent by observing branch mispredict variations and instruction count variations, and the detection is easily scriptable.

How-to gather performance counter data

The Google Benchmark project simplifies the process of writing a benchmark. An example of its use may be seen here

The benchmark harness support for performance counters consists of allowing the user to specify up to 3 counters in a comma-separated list, via the --benchmark_perf_counters flag, to be measured alongside the time measurement. Just like time measurement, each counter value is captured right before the benchmarked code is run, and right after. The difference is reported to the user as per-iteration values (similar to the time measurement).

Basic usage

Note: counter names are hardware vendor and version specific. The example here assumes Intel Skylake. Check how this maps to other versions of Intel CPUs, other vendors (e.g. AMD), or other architectures (e.g. ARM); also refer to perfmon2 which we use for counter name resolution, and/or perf list.

Build a benchmark executable - for example, let’s use “swissmap” from fleetbench:

bazel build -c opt //fleetbench/swissmap:swissmap_benchmark

Run the benchmark; let’s ask for instructions, cycles, and loads:

bazel-bin/fleetbench/swissmap/swissmap_benchmark \
  --benchmark_filter='.*Cold.*::absl::flat_hash_set.*64.*set_size:64.*density:0' \
  --benchmark_perf_counters=INSTRUCTIONS,CYCLES,MEM_UOPS_RETIRED:ALL_LOADS

The output looks like:

Running ./swissmap_benchmark
Run on (8 X 4667.91 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 2.31, 2.08, 1.95
---------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------------------
BM_FindMiss_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0                  18.4 ns         18.4 ns     39048136 CYCLES=82.9019 INSTRUCTIONS=35.7284 MEM_UOPS_RETIRED:ALL_LOADS=6.05507
BM_FindHit_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0                   33.3 ns         33.3 ns     20600490 CYCLES=152.156 INSTRUCTIONS=55.0354 MEM_UOPS_RETIRED:ALL_LOADS=15.0034
BM_InsertHit_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0                 34.8 ns         34.8 ns     19004416 CYCLES=157.956 INSTRUCTIONS=59.0354 MEM_UOPS_RETIRED:ALL_LOADS=16.0013
BM_Iterate_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0                   33.5 ns         33.5 ns     25444389 CYCLES=152.431 INSTRUCTIONS=57.9225 MEM_UOPS_RETIRED:ALL_LOADS=13.3892
BM_InsertManyOrdered_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0         54.9 ns         54.8 ns     14141958 CYCLES=242.373 INSTRUCTIONS=111.455 MEM_UOPS_RETIRED:ALL_LOADS=33.1838
BM_InsertManyUnordered_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0       50.0 ns         50.0 ns     14234753 CYCLES=227.516 INSTRUCTIONS=111.415 MEM_UOPS_RETIRED:ALL_LOADS=33.1781

So we can see that BM_FindMiss_Cold took approximately 83 cycles, 36 instructions, and 6 memory ops per iteration.

Limitations

Number of counters: At most 32 events may be requested for simultaneous collection. Note however, that the number of hardware counters available is much lower (usually 4-8 on modern CPUs) – requesting more events than the hardware counters will cause multiplexing and decreased accuracy.
Visualization: There is no visualization available, so the user needs to rely on collecting JSON result files and summarizing the results.
Counting vs. Sampling: The framework only collects counters in “counting” mode – it answers how many cycles/cache misses/etc. happened, but not does not associate them to the code location causing the events. For that, you’d need a sampling profiler like Linux perf.

Summary

Use the --benchmark_perf_counters flag in https://github.com/google/benchmark benchmarks to quickly drill into the root cause of a performance regression, or to guide performance optimization work.

Performance Guide

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate