Performance Tip of the Week #53: Precise C++ benchmark measurements with Hardware Performance Counters

Originally posted as Fast TotW #53 on October 14, 2021

By Mircea Trofin

Updated 2023-03-02

Quicklink: abseil.io/fast/53

Use performance benchmarks as the first line of defense in detecting costly regressions, and as a way to guide performance improvement work. Getting to the root cause of a change in performance can be time consuming and full of “false leads”, because on modern architectures program execution is influenced by many factors.

In this episode, we present a productivity tool that helps lower the cost of performance investigations by leveraging Hardware Performance Counters to surface low-level architectural metrics. The tool is available for C++ benchmarks running on Linux, on GitHub.

What are Hardware Performance Counters?

Hardware Performance Counters are a hardware feature where you can request precise counts of events such as: instructions retired, load or store instructions retired, clock cycles, cache misses, branches taken or mispredicted, etc. See https://perf.wiki.kernel.org/index.php/Tutorial for more information.

With performance counters you get less noisy measurements when compared to time-based ones. CPU timer-based measurements are noisier, even on isolated machines, because:

  • performance counter measurements can be isolated to the benchmark process, and, thus, not account context switching time when the process is preempted - which is otherwise measured by the CPU timer,
  • we can further isolate counter increments to user mode executed instructions only, which further reduces noise due to context switching
  • specific counters (e.g. instruction-counting ones) inherently produce measurements that are almost noise-free (variations under 0.01%). This is because the value of such counters is independent of systemic sources of noise like frequency throttling.
  • finally, depending on the setup of the benchmarking machine, time-based measurements suffer from noise introduced by thermal effects, hyperthreading, or shared resource (like memory bus) access. Some counters will also suffer from noise due to these, but others - like instructions retired counters - won’t.

By selecting appropriate performance counters you can get nuanced insight into the execution of a benchmark. For instance, a measurement using CPU time that points to a regression may be caused by subtle changes in executable layout, which increases branch mispredictions. This is generally not actionable and considered acceptable. Identifying this is the case, when only looking at time measurements, is not very productive and not scalable over a large benchmark suite corpus. With performance counter-based measurements, it is immediately apparent by observing branch mispredict variations and instruction count variations, and the detection is easily scriptable.

How-to gather performance counter data

The Google Benchmark project simplifies the process of writing a benchmark. An example of its use may be seen here

The benchmark harness support for performance counters consists of allowing the user to specify up to 3 counters in a comma-separated list, via the --benchmark_perf_counters flag, to be measured alongside the time measurement. Just like time measurement, each counter value is captured right before the benchmarked code is run, and right after. The difference is reported to the user as per-iteration values (similar to the time measurement). The report is only available in the JSON output (--benchmark_format=json).

Simple example

Note: counter names are hardware vendor and version specific. The example here assumes Intel Skylake. Check how this maps to other versions of Intel CPUs, other vendors (e.g. AMD), or other architectures (e.g. ARM); also refer to perfmon2 which we use for counter name resolution, and/or perf list.

Build a benchmark executable - for example, let’s use “swissmap” from fleetbench:

bazel build -c opt //fleetbench/swissmap:swissmap_benchmark

Run the benchmark; let’s ask for instructions, cycles, and loads:

bazel-bin/fleetbench/swissmap/swissmap_benchmark --benchmarks=all --benchmark_perf_counters=INSTRUCTIONS,CYCLES,MEM_UOPS_RETIRED:ALL_LOADS --benchmark_format=json

The output JSON file is organized as follows:

{
  "benchmarks": [
    {
      "CYCLES": 183357.29158733244,
      "INSTRUCTIONS": 603772.790402176,
      "MEM_UOPS_RETIRED:ALL_LOADS": 121.63652613172722,
      "bytes_per_second": 1804401396.9863303,
      "cpu_time_ns": 56750.122323683696,
      "iterations": 25735,
      "label": "html",
      "name": "BM_UDataBuffer/0",
      "real_time_ns": 56900.075383718671
    },
    {
      "CYCLES": 183782.38686892079,
      "INSTRUCTIONS": 603772.91427358345,
      "MEM_UOPS_RETIRED:ALL_LOADS": 119.59456538520921,
      "bytes_per_second": 1825391775.0291102,
      "cpu_time_ns": 56097.546510730273,
      "iterations": 25908,
      "label": "html",
      "name": "BM_UDataBuffer/0",
      "real_time_ns": 56245.906090782773
    },
    [...]
}

For each run of the benchmark, the requested counters and their values are captured in a JSON dictionary. The values are per-iteration (note the iterations field). In the first run the benchmark completed 25735 iterations, so the total value for CYCLES measured by the benchmark was 183357.29158733244 * 25735.

Summary

Use the --benchmark_perf_counters flag in https://github.com/google/benchmark benchmarks to quickly drill into the root cause of a performance regression, or to guide performance optimization work.


Subscribe to the Abseil Blog