Originally posted as Fast TotW #53 on October 14, 2021
Updated 2023-03-02
Quicklink: abseil.io/fast/53
Use performance benchmarks as the first line of defense in detecting costly regressions, and as a way to guide performance improvement work. Getting to the root cause of a change in performance can be time consuming and full of “false leads”, because on modern architectures program execution is influenced by many factors.
In this episode, we present a productivity tool that helps lower the cost of performance investigations by leveraging Hardware Performance Counters to surface low-level architectural metrics. The tool is available for C++ benchmarks running on Linux, on GitHub.
Hardware Performance Counters are a hardware feature where you can request precise counts of events such as: instructions retired, load or store instructions retired, clock cycles, cache misses, branches taken or mispredicted, etc. See https://perf.wiki.kernel.org/index.php/Tutorial for more information.
With performance counters you get less noisy measurements when compared to time-based ones. CPU timer-based measurements are noisier, even on isolated machines, because:
By selecting appropriate performance counters you can get nuanced insight into the execution of a benchmark. For instance, a measurement using CPU time that points to a regression may be caused by subtle changes in executable layout, which increases branch mispredictions. This is generally not actionable and considered acceptable. Identifying this is the case, when only looking at time measurements, is not very productive and not scalable over a large benchmark suite corpus. With performance counter-based measurements, it is immediately apparent by observing branch mispredict variations and instruction count variations, and the detection is easily scriptable.
The Google Benchmark project simplifies the process of writing a benchmark. An example of its use may be seen here
The benchmark harness support for performance counters consists of allowing the
user to specify up to 3 counters in a comma-separated list, via the
--benchmark_perf_counters
flag, to be measured alongside the time measurement.
Just like time measurement, each counter value is captured right before the
benchmarked code is run, and right after. The difference is reported to the user
as per-iteration values (similar to the time measurement). The report is only
available in the JSON output (--benchmark_format=json
).
Note: counter names are hardware vendor and version specific. The example
here assumes Intel Skylake. Check how this maps to other versions of Intel CPUs,
other vendors (e.g. AMD), or other architectures (e.g. ARM); also refer to
perfmon2 which we use for counter name
resolution, and/or perf list
.
Build a benchmark executable - for example, let’s use “swissmap” from fleetbench:
bazel build -c opt //fleetbench/swissmap:swissmap_benchmark
Run the benchmark; let’s ask for instructions, cycles, and loads:
bazel-bin/fleetbench/swissmap/swissmap_benchmark --benchmarks=all --benchmark_perf_counters=INSTRUCTIONS,CYCLES,MEM_UOPS_RETIRED:ALL_LOADS --benchmark_format=json
The output JSON file is organized as follows:
{ "benchmarks": [ { "CYCLES": 183357.29158733244, "INSTRUCTIONS": 603772.790402176, "MEM_UOPS_RETIRED:ALL_LOADS": 121.63652613172722, "bytes_per_second": 1804401396.9863303, "cpu_time_ns": 56750.122323683696, "iterations": 25735, "label": "html", "name": "BM_UDataBuffer/0", "real_time_ns": 56900.075383718671 }, { "CYCLES": 183782.38686892079, "INSTRUCTIONS": 603772.91427358345, "MEM_UOPS_RETIRED:ALL_LOADS": 119.59456538520921, "bytes_per_second": 1825391775.0291102, "cpu_time_ns": 56097.546510730273, "iterations": 25908, "label": "html", "name": "BM_UDataBuffer/0", "real_time_ns": 56245.906090782773 }, [...] }
For each run of the benchmark, the requested counters and their values are
captured in a JSON dictionary. The values are per-iteration (note the
iterations
field). In the first run the benchmark completed 25735
iterations, so the total value for CYCLES measured by the benchmark was
183357.29158733244 * 25735
.
Use the --benchmark_perf_counters
flag in https://github.com/google/benchmark
benchmarks to quickly drill into the root cause of a performance regression, or
to guide performance optimization work.