Originally posted as Fast TotW #53 on October 14, 2021
Updated 2023-09-04
Quicklink: abseil.io/fast/53
Use performance benchmarks as the first line of defense in detecting costly regressions, and as a way to guide performance improvement work. Getting to the root cause of a change in performance can be time consuming and full of “false leads”, because on modern architectures program execution is influenced by many factors.
In this episode, we present a productivity tool that helps lower the cost of performance investigations by leveraging Hardware Performance Counters to surface low-level architectural metrics. The tool is available for C++ benchmarks running on Linux, on GitHub.
Hardware Performance Counters are a hardware feature where you can request precise counts of events such as: instructions retired, load or store instructions retired, clock cycles, cache misses, branches taken or mispredicted, etc. See https://perf.wiki.kernel.org/index.php/Tutorial for more information.
With performance counters you get less noisy measurements when compared to time-based ones. CPU timer-based measurements are noisier, even on isolated machines, because:
By selecting appropriate performance counters you can get nuanced insight into the execution of a benchmark. For instance, a measurement using CPU time that points to a regression may be caused by subtle changes in executable layout, which increases branch mispredictions. This is generally not actionable and considered acceptable. Identifying this is the case, when only looking at time measurements, is not very productive and not scalable over a large benchmark suite corpus. With performance counter-based measurements, it is immediately apparent by observing branch mispredict variations and instruction count variations, and the detection is easily scriptable.
The Google Benchmark project simplifies the process of writing a benchmark. An example of its use may be seen here
The benchmark harness support for performance counters consists of allowing the
user to specify up to 3 counters in a comma-separated list, via the
--benchmark_perf_counters
flag, to be measured alongside the time measurement.
Just like time measurement, each counter value is captured right before the
benchmarked code is run, and right after. The difference is reported to the user
as per-iteration values (similar to the time measurement).
Note: counter names are hardware vendor and version specific. The example
here assumes Intel Skylake. Check how this maps to other versions of Intel CPUs,
other vendors (e.g. AMD), or other architectures (e.g. ARM); also refer to
perfmon2 which we use for counter name
resolution, and/or perf list
.
Build a benchmark executable - for example, let’s use “swissmap” from fleetbench:
bazel build -c opt //fleetbench/swissmap:swissmap_benchmark
Run the benchmark; let’s ask for instructions, cycles, and loads:
bazel-bin/fleetbench/swissmap/swissmap_benchmark \ --benchmark_filter='.*Cold.*::absl::flat_hash_set.*64.*set_size:64.*density:0' \ --benchmark_perf_counters=INSTRUCTIONS,CYCLES,MEM_UOPS_RETIRED:ALL_LOADS
The output looks like:
Running ./swissmap_benchmark
Run on (8 X 4667.91 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 8192 KiB (x1)
Load Average: 2.31, 2.08, 1.95
---------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------------------
BM_FindMiss_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0 18.4 ns 18.4 ns 39048136 CYCLES=82.9019 INSTRUCTIONS=35.7284 MEM_UOPS_RETIRED:ALL_LOADS=6.05507
BM_FindHit_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0 33.3 ns 33.3 ns 20600490 CYCLES=152.156 INSTRUCTIONS=55.0354 MEM_UOPS_RETIRED:ALL_LOADS=15.0034
BM_InsertHit_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0 34.8 ns 34.8 ns 19004416 CYCLES=157.956 INSTRUCTIONS=59.0354 MEM_UOPS_RETIRED:ALL_LOADS=16.0013
BM_Iterate_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0 33.5 ns 33.5 ns 25444389 CYCLES=152.431 INSTRUCTIONS=57.9225 MEM_UOPS_RETIRED:ALL_LOADS=13.3892
BM_InsertManyOrdered_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0 54.9 ns 54.8 ns 14141958 CYCLES=242.373 INSTRUCTIONS=111.455 MEM_UOPS_RETIRED:ALL_LOADS=33.1838
BM_InsertManyUnordered_Cold<::absl::flat_hash_set, 64>/set_size:64/density:0 50.0 ns 50.0 ns 14234753 CYCLES=227.516 INSTRUCTIONS=111.415 MEM_UOPS_RETIRED:ALL_LOADS=33.1781
So we can see that BM_FindMiss_Cold
took approximately 83 cycles, 36
instructions, and 6 memory ops per iteration.
Number of counters: At most 32 events may be requested for simultaneous collection. Note however, that the number of hardware counters available is much lower (usually 4-8 on modern CPUs) – requesting more events than the hardware counters will cause multiplexing and decreased accuracy.
Visualization: There is no visualization available, so the user needs to rely on collecting JSON result files and summarizing the results.
Counting vs. Sampling: The framework only collects counters in “counting” mode – it answers how many cycles/cache misses/etc. happened, but not does not associate them to the code location causing the events. For that, you’d need a sampling profiler like Linux perf.
Use the --benchmark_perf_counters
flag in https://github.com/google/benchmark
benchmarks to quickly drill into the root cause of a performance regression, or
to guide performance optimization work.