Performance Tip of the Week #75: How to microbenchmark

Originally posted as Fast TotW #75 on September 29, 2023

By Chris Kennelly

Updated 2023-11-10

Quicklink: abseil.io/fast/75

Imagine having access to two Oracles. One gives 90% accurate predictions and responds in 10 minutes and one that gives 99% accurate predictions and responds in 1 month. With access to these secret seers, which will let us be best able to make good decisions?

Production is ultimately what matters” might obviate the need for microbenchmarks altogether, but they are a useful tool in the performance optimization toolbox. While reduced fidelity doesn’t sound desirable at face value, it comes with an important tradeoff: speed.

In this episode, we discuss techniques for using microbenchmarks as an important tool for decision-making. Full documentation for the microbenchmark framework is available at https://github.com/google/benchmark/blob/main/docs/user_guide.md.

Preliminaries

Eli Bendersky has written about a number of pitfalls that can come up when using microbenchmarks. In writing our own series of microbenchmarks to evaluate a varint parser, we’ll apply several of these techniques to get useful answers.

template<bool DoBMI = false>
const uint8_t* ParseVarint32(const uint8_t* buf, uint32_t* value) {
  uint8_t byte;
  int count = 0;
  uint32_t tmp = 0;
  if (DoBMI) {
#ifdef __BMI2__
#ifndef ABSL_IS_LITTLE_ENDIAN
#error "Only supports little endian architectures"
#endif
    // Read 8 bytes.   This requires our input buffer be appropriately padded to
    // permit an overread.
    uint64_t r;
    memcpy(&r, buf, sizeof(r));

    int length = (absl::countr_zero(~r & uint64_t{0x8080808080}) >> 3) + 1;
    buf = buf + length;
    *value = _bzhi_u64(_pext_u64(r, uint64_t{0x7F7F7F7F7F}), 7 * length);
    return buf;
#endif  // __BMI2__
  }
  do {
    byte = buf[0];
    tmp |= (byte & 0x7F) << (7 * count);
    ++buf;
    ++count;
  } while (byte & 0x80);
  *value = tmp;
  return buf;
}

void BM_OneshotVarintParsing(benchmark::State& state) {
  absl::BitGen rng;
  const uint32_t target = absl::Uniform(rng, 0u, std::numeric_limits<uint32_t>::max());
  uint8_t buf[10];
  const uint8_t* end_buf = WriteVarint32(buf, target);
  ABSL_CHECK(end_buf <= buf + sizeof(buf));

  for (auto s : state) {
    uint32_t read;
    ParseVarint32(buf, &read);
  }
}

BENCHMARK(BM_OneshotVarintParsing);

Running this benchmark shows us spending 0 nanoseconds parsing. This result is implausible and should give cause for concern.

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_OneshotVarintParsing      0.000 ns        0.000 ns   1000000000

Annotating read with benchmark::DoNotOptimize produces more realistic results. This prevents the compiler from recognizing that the return value is unused and eliminating the loop body. We also add ABSL_DCHECK_EQ to verify the result is correct. While this will be a no-op in optimized builds, this is a lightweight way to double-check the calculation to make sure our microbenchmark is functionally correct.

Nevertheless, the benchmark varies from run-to-run:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_OneshotVarintParsing       5.31 ns         5.29 ns    131846276
...
BM_OneshotVarintParsing       4.37 ns         4.36 ns    132147859

Since we are targeting a variable length integer, chosen via a random number generator, we aren’t doing a consistent amount of work. The vast majority of integers in the range [0, UINT_MAX] will be encoded as several bytes, but sometimes we’ll choose a small integer and skew our benchmark results. Separately benchmarking by varint size lets us differentiate these cases.

void BM_OneshotVarintParsingSingleWidth(benchmark::State& state) {
  const int shift = state.range(0);
  absl::BitGen rng;
  const uint64_t lo = 1u << (7 * shift);
  const uint64_t hi = std::min<uint64_t>((lo << 7) - 1u, UINT_MAX);
  const uint32_t target = absl::Uniform<uint32_t>(rng, lo, hi);

  uint8_t buf[10];
  const uint8_t* end_buf = WriteVarint32(buf, target);
  ABSL_CHECK(end_buf <= buf + sizeof(buf));

  for (auto s : state) {
    uint32_t read;
    ParseVarint32(buf, &read);
    benchmark::DoNotOptimize(read);
    ABSL_DCHECK_EQ(read, target);
  }
}

BENCHMARK(BM_OneshotVarintParsingSingleWidth)->DenseRange(0, 4);
BM_OneshotVarintParsingSingleWidth/0              0.955 ns        0.955 ns    749769481
BM_OneshotVarintParsingSingleWidth/1               1.61 ns         1.61 ns    434548050
BM_OneshotVarintParsingSingleWidth/2               2.12 ns         2.12 ns    333038803
BM_OneshotVarintParsingSingleWidth/3               2.80 ns         2.80 ns    250300587
BM_OneshotVarintParsingSingleWidth/4               3.27 ns         3.27 ns    215507384

Gauging latency

Real-world parsing does not parse a single varint at a time like the previous benchmark does, and that can have some profound effects on what we measure. For example, there, since we’re iterating over the same buffer, and there’s no dependency on the last value, the processor is very likely to be able to speculatively start the next iteration and won’t need to undo the work. This converts a benchmark that we’d like to measure as a chain of dependencies into a measurement of the number of pipelines that the processor has (or the duration of the dependency chain divided by the number of parallel executions).

To make the benchmark more realistic, we can instead parse from a larger buffer of varints serialized end-on-end:

void BM_VarintParsingEndOnEnd(benchmark::State& state) {
  const int shift = state.range(0);
  absl::BitGen rng;
  const uint64_t lo = 1u << (7 * shift);
  const uint64_t hi = std::min<uint64_t>((lo << 7) - 1u, UINT_MAX);
  const uint32_t target = absl::Uniform<uint32_t>(rng, lo, hi);

  const int kNumVarints = 10000;
  std::vector<uint8_t> v(5 * kNumVarints, 0);
  uint8_t* buf = &v[0];
  for (int i = 0; i < kNumVarints; ++i) {
    buf = WriteVarint32(buf, target);
  }
  const uint8_t* const end_buf = buf;

  // Use KeepRunningBatch to get an accurate, per-item timing.
  while (state.KeepRunningBatch(kNumVarints)) {
    // Start from the top, so we don't just run a single meaningful
    // iteration and stop.
    const uint8_t* read_buf = &v[0];
    while (read_buf < end_buf) {
      uint32_t read;
      read_buf = ParseVarint32(read_buf, &read);
      benchmark::DoNotOptimize(read);
      ABSL_DCHECK_EQ(read, target);
    }
  }

  state.SetItemsProcessed(state.iterations() * kNumVarints);
}

We use KeepRunningBatch since each iteration is processing kNumVarint elements at a time, rather than 1 element.

BM_VarintParsingEndOnEnd/0                        0.933 ns        0.933 ns    749030000 bytes_per_second=1022.04Mi/s items_per_second=1.07169G/s
BM_VarintParsingEndOnEnd/1                         1.64 ns         1.64 ns    432610000 bytes_per_second=1.13844Gi/s items_per_second=611.197M/s
BM_VarintParsingEndOnEnd/2                         2.16 ns         2.16 ns    331480000 bytes_per_second=1.2962Gi/s items_per_second=463.928M/s
BM_VarintParsingEndOnEnd/3                         3.11 ns         3.11 ns    249780000 bytes_per_second=1.19976Gi/s items_per_second=322.058M/s
BM_VarintParsingEndOnEnd/4                         3.27 ns         3.27 ns    216500000 bytes_per_second=1.42589Gi/s items_per_second=306.208M/s

This approach lets us gauge how quickly we figure out the next place to deserialize and there’s a dependency between the end of the previous iteration and the start of the next. Still note, however, that since the varint sizes are invariant, the processor can likely speculate on the next iteration due to successful branch prediction, so this is by no means perfect.

This technique can also be employed for other algorithms, such as hashing, hashtable lookups, and SIMD problems. It can also be helpful for understanding when the processor’s execution backends are port-bound or otherwise lead to a long-critical path.

Understanding the speed of light

Before embarking too far on optimizing the ParseVarint32 routine, we might want to identify the “speed of light” of the hardware. For varint parsing, this is approximately memcpy, since we are reading serialized bytes and writing the (mostly expanded) bytes into the parsed data structure. While this is not quite the operation we’re interested in, it’s readily available off the shelf without much effort.

void BM_Memcpy(benchmark::State& state) {
  const int shift = state.range(0);
  const int kNumVarints = 10000;
  // Simulate the skeleton of BM_VarintParsingEndOnEnd by memcpy'ing, to
  // understand the approximate "speed of light" of the operation.
  std::vector<uint8_t> src(5 * kNumVarints, 0);
  std::vector<uint8_t> dst(5 * kNumVarints, 0);

  while (state.KeepRunningBatch(kNumVarints)) {
    memcpy(&dst[0], &src[0], (shift + 1) * kNumVarints);
    benchmark::DoNotOptimize(&src[0]);
    benchmark::DoNotOptimize(&dst[0]);
  }

  state.SetBytesProcessed(state.iterations() * (shift + 1));
}

BENCHMARK(BM_Memcpy)->DenseRange(0, 4);
BM_Memcpy/0                               0.027 ns        0.027 ns   1000000000 bytes_per_second=34.9201Gi/s
BM_Memcpy/1                               0.053 ns        0.053 ns   1000000000 bytes_per_second=35.3181Gi/s
BM_Memcpy/2                               0.080 ns        0.079 ns   1000000000 bytes_per_second=35.1449Gi/s
BM_Memcpy/3                               0.106 ns        0.106 ns   1000000000 bytes_per_second=35.308Gi/s
BM_Memcpy/4                               0.134 ns        0.133 ns   1000000000 bytes_per_second=34.9284Gi/s

It’s also important to do a rough double-check of these numbers. 0.027ns per “operation” should normally seem implausibly fast and set off alarm bells, but on a 2Ghz processor we might be issuing a 32-byte load and store every clock cycle. Each “varint” here is a single byte in the fastest case and 0.027ns is approximately ~36 bytes per nanosecond or ~18 bytes per clock, which aligns with the processor’s abilities.

Increasing representativeness

This particular benchmark works by parsing a single, varint width at a time. Once we shifted to the end-on-end buffer style, we can actually choose a random distribution of varint sizes. Varints shine when values are small and we can roughly assume that the distribution is accordingly small-skewed. Production profiles largely validate this assumption.

Using a synthetic distribution (like absl::LogUniform) can help us simulate this. This helps fight one pitfall of microbenchmarks by using more realistic data and avoiding “training” the branch-predictor too well.

void BM_VarintParsingSkewed(benchmark::State& state) {
  absl::BitGen rng;

  const int kNumVarints = 10000;
  std::vector<uint8_t> v(5 * kNumVarints, 0);
  uint8_t* buf = &v[0];
  for (int i = 0; i < kNumVarints; ++i) {
    // Initialize using a small-skewed value to create a synthetic distribution
    // of non-uniform varint sizes that roughly resemble real world ones.
    buf = WriteVarint32(
        buf, absl::LogUniform(rng, 0u, std::numeric_limits<uint32_t>::max()));
  }
  const uint8_t* const end_buf = buf;

  // Use KeepRunningBatch to get an accurate, per-item estimate of cost,
  // invariant from kNumVarints value.
  while (state.KeepRunningBatch(kNumVarints)) {
    // Start from the top every time, otherwise we run the loop once and stop,
    // so per-item speeds are infinitely fast.
    const uint8_t* read_buf = &v[0];
    while (read_buf < end_buf) {
      uint32_t read;
      read_buf = ParseVarint32(read_buf, &read);
      benchmark::DoNotOptimize(read);
    }
  }

  state.SetItemsProcessed(state.iterations());
  state.SetBytesProcessed(state.iterations() * (end_buf - &v[0]) / kNumVarints);
}

We can use the benchmark library’s built-in performance counter feature to capture per-benchmark statistics around branch mispredictions (--benchmark_perf_counters=BR_MISP_RETIRED:ALL_BRANCHES). This confirms that we processed enough varints (kNumVarints = 10000) to overwhelm the branch predictor’s ability to memorize the input pattern.

BM_VarintParsingEndOnEnd/0                        0.933 ns        0.933 ns    749030000 BR_MISP_RETIRED:ALL_BRANCHES=100.032u bytes_per_second=1022.04Mi/s items_per_second=1.07169G/s
BM_VarintParsingEndOnEnd/1                         1.64 ns         1.64 ns    432610000 BR_MISP_RETIRED:ALL_BRANCHES=100.999u bytes_per_second=1.13844Gi/s items_per_second=611.197M/s
BM_VarintParsingEndOnEnd/2                         2.16 ns         2.16 ns    331480000 BR_MISP_RETIRED:ALL_BRANCHES=8.9521m bytes_per_second=1.2962Gi/s items_per_second=463.928M/s
BM_VarintParsingEndOnEnd/3                         3.11 ns         3.11 ns    249780000 BR_MISP_RETIRED:ALL_BRANCHES=328.881u bytes_per_second=1.19976Gi/s items_per_second=322.058M/s
BM_VarintParsingEndOnEnd/4                         3.27 ns         3.27 ns    216500000 BR_MISP_RETIRED:ALL_BRANCHES=112.993u bytes_per_second=1.42589Gi/s items_per_second=306.208M/s
BM_VarintParsingSkewed                             5.66 ns         5.66 ns    123440000 BR_MISP_RETIRED:ALL_BRANCHES=0.762816 bytes_per_second=465.11Mi/s items_per_second=176.563M/s

We can back out the approximate item size from these throughputs as roughly 465/176, or ~2.6 bytes/varint on average. Because we incur more branch mispredictions, performance is worse than any of the end-on-end inputs.

Intentionally leaning into non-representativeness

With our current microbenchmark, we uniformly fit inside the processor’s L1 cache. When this tip was written, a processor with a 32KB L1 data cache was used for the benchmark result examples. In comparison, we’re parsing a series of 1K varints, whose length is at most 5 bytes each. This ensures that the data can always fit into cache.

The microbenchmark library framework includes helper methods (like benchmark::CPUInfo) to access information about the hardware we’re running on. This allows us to sweep through the parameter space of cache sizes, from a single varint (at most 5 bytes) to many multiples of the last cache’s size. We’ll also evaluate an optimization using BMI2 (implementation omitted for brevity).

template<bool DoBMI>
void BM_VarintParsingSkewedCaches(benchmark::State& state) {
  const int kNumVarints = state.range(0);
  absl::BitGen rng;

  // Pad input vector to permit an overread with the DoBMI implementation.
  std::vector<uint8_t> v(5 * kNumVarints + (DoBMI ? 8 : 0), 0);
  uint8_t* buf = &v[0];
  for (int i = 0; i < kNumVarints; ++i) {
    buf = WriteVarint32(buf, absl::LogUniform(rng, 1u, std::numeric_limits<uint32_t>::max()));
  }
  const uint8_t* const end_buf = buf;

  // Use KeepRunningBatch to get an accurate, per-item estimate of cost,
  // invariant from kNumVarints value.
  const uint8_t* read_buf;
  while (state.KeepRunningBatch(kNumVarints)) {
    // Start from the top every time, otherwise we run the loop once and stop,
    // so per-item speeds are infinitely fast.
    read_buf = &v[0];
    while (read_buf < end_buf) {
      uint32_t read;
      read_buf = ParseVarint32<DoBMI>(read_buf, &read);
      benchmark::DoNotOptimize(read);
    }
  }

  state.SetItemsProcessed(state.iterations());
  state.SetBytesProcessed(state.iterations() * (end_buf - &v[0]) / kNumVarints);
}

BENCHMARK_TEMPLATE(BM_VarintParsingSkewedCaches, false)
    ->Range(1, 4 * benchmark::CPUInfo::Get().caches.back().size);
BENCHMARK_TEMPLATE(BM_VarintParsingSkewedCaches, true)
    ->Range(1, 4 * benchmark::CPUInfo::Get().caches.back().size);
BM_VarintParsingSkewedCaches<false>/1               3.35 ns         3.34 ns    457036460 BR_MISP_RETIRED=6.03891u CYCLES=11.0191 INSTRUCTIONS=50 bytes_per_second=1.11618Gi/s items_per_second=299.622M/s
BM_VarintParsingSkewedCaches<false>/8               2.13 ns         2.12 ns    268979016 BR_MISP_RETIRED=9.79259u CYCLES=7.01295 INSTRUCTIONS=31.75 bytes_per_second=1.20814Gi/s items_per_second=471.718M/s
BM_VarintParsingSkewedCaches<false>/64              2.05 ns         2.05 ns    321649792 BR_MISP_RETIRED=22.9504u CYCLES=6.73112 INSTRUCTIONS=30.5938 bytes_per_second=1.23747Gi/s items_per_second=488.727M/s
BM_VarintParsingSkewedCaches<false>/512             2.05 ns         2.05 ns    313550336 BR_MISP_RETIRED=1.31374m CYCLES=6.75802 INSTRUCTIONS=30.6074 bytes_per_second=1.24299Gi/s items_per_second=488.45M/s
BM_VarintParsingSkewedCaches<false>/4096            6.37 ns         6.36 ns    106352640 BR_MISP_RETIRED=0.652793 CYCLES=20.8777 INSTRUCTIONS=31.0552 bytes_per_second=417.736Mi/s items_per_second=157.355M/s
BM_VarintParsingSkewedCaches<false>/32768           7.97 ns         7.95 ns     89980928 BR_MISP_RETIRED=0.902907 CYCLES=26.1535 INSTRUCTIONS=30.8263 bytes_per_second=330.981Mi/s items_per_second=125.816M/s
BM_VarintParsingSkewedCaches<false>/262144          8.66 ns         8.63 ns     83361792 BR_MISP_RETIRED=0.886097 CYCLES=25.8864 INSTRUCTIONS=30.7883 bytes_per_second=304.299Mi/s items_per_second=115.85M/s
BM_VarintParsingSkewedCaches<false>/2097152         7.88 ns         7.86 ns     90177536 BR_MISP_RETIRED=0.886966 CYCLES=25.9592 INSTRUCTIONS=30.8173 bytes_per_second=334.775Mi/s items_per_second=127.303M/s
BM_VarintParsingSkewedCaches<false>/16777216        7.91 ns         7.89 ns    100663296 BR_MISP_RETIRED=0.885126 CYCLES=26.0696 INSTRUCTIONS=30.8228 bytes_per_second=333.415Mi/s items_per_second=126.759M/s
BM_VarintParsingSkewedCaches<false>/134217728       8.52 ns         8.50 ns    134217728 BR_MISP_RETIRED=0.884259 CYCLES=26.0602 INSTRUCTIONS=30.8234 bytes_per_second=309.466Mi/s items_per_second=117.651M/s
BM_VarintParsingSkewedCaches<false>/161480704       8.06 ns         8.04 ns    161480704 BR_MISP_RETIRED=0.885522 CYCLES=26.1663 INSTRUCTIONS=30.8221 bytes_per_second=327.339Mi/s items_per_second=124.452M/s

We see inflections in performance as the working set increases beyond each cache tier. The benchmark framework can concurrently collect performance counters and we also asymptote to 0.88 branch predictions per varint that we parse on average.

Since our branches determine how quickly we move through the stream, we can see an improvement by switching to a branchless implementation using BMI2. Our performance is a bit worse for some of the small cases (where we’re already in L1 and can memorize the branches) but asymptote to a faster per-varint speed when working with large buffers. The PMU events show very few (<5e-6 per varint) branch mispredictions.

BM_VarintParsingSkewedCaches<true>/1                1.68 ns         1.68 ns    414588127 BR_MISP_RETIRED=31.3564n CYCLES=5.2722 INSTRUCTIONS=23 bytes_per_second=1.66702Gi/s items_per_second=596.649M/s
BM_VarintParsingSkewedCaches<true>/8                2.32 ns         2.32 ns    301227112 BR_MISP_RETIRED=4.35884u CYCLES=7.43719 INSTRUCTIONS=16 bytes_per_second=1.30667Gi/s items_per_second=431.699M/s
BM_VarintParsingSkewedCaches<true>/64               3.83 ns         3.82 ns    184510144 BR_MISP_RETIRED=11.3761u CYCLES=12.7085 INSTRUCTIONS=15.125 bytes_per_second=667.478Mi/s items_per_second=261.951M/s
BM_VarintParsingSkewedCaches<true>/512              4.24 ns         4.23 ns    174177280 BR_MISP_RETIRED=1.954m CYCLES=13.4285 INSTRUCTIONS=15.0156 bytes_per_second=613.728Mi/s items_per_second=236.535M/s
BM_VarintParsingSkewedCaches<true>/4096             4.04 ns         4.03 ns    173842432 BR_MISP_RETIRED=245.849u CYCLES=13.4066 INSTRUCTIONS=15.002 bytes_per_second=646.705Mi/s items_per_second=248.375M/s
BM_VarintParsingSkewedCaches<true>/32768            4.06 ns         4.05 ns    173244416 BR_MISP_RETIRED=44.8961u CYCLES=13.4526 INSTRUCTIONS=15.0003 bytes_per_second=646.09Mi/s items_per_second=246.625M/s
BM_VarintParsingSkewedCaches<true>/262144           4.40 ns         4.39 ns    173277184 BR_MISP_RETIRED=5.36135u CYCLES=13.4397 INSTRUCTIONS=15 bytes_per_second=598.464Mi/s items_per_second=227.561M/s
BM_VarintParsingSkewedCaches<true>/2097152          4.07 ns         4.06 ns    142606336 BR_MISP_RETIRED=1.66192u CYCLES=13.4542 INSTRUCTIONS=15 bytes_per_second=647.626Mi/s items_per_second=246.183M/s
BM_VarintParsingSkewedCaches<true>/16777216         4.07 ns         4.06 ns    184549376 BR_MISP_RETIRED=671.907n CYCLES=13.4525 INSTRUCTIONS=15 bytes_per_second=647.685Mi/s items_per_second=246.264M/s
BM_VarintParsingSkewedCaches<true>/134217728        4.05 ns         4.04 ns    134217728 BR_MISP_RETIRED=1.09524u CYCLES=13.4582 INSTRUCTIONS=15 bytes_per_second=651.6Mi/s items_per_second=247.741M/s
BM_VarintParsingSkewedCaches<true>/161480704        4.01 ns         4.00 ns    161480704 BR_MISP_RETIRED=829.821n CYCLES=13.4539 INSTRUCTIONS=15 bytes_per_second=658.024Mi/s items_per_second=250.184M/s

This pattern is also used in fleetbench. For example, it has a “hot” SwissMap benchmark that performs its operations (lookups, etc.) against a single instance and a “cold” SwissMap benchmark where we randomly pick a table on each iteration. The latter makes it more likely that we’ll incur a cache miss. Hardware counters and the benchmark framework’s support for collecting them can help diagnose and explain performance differences.

Even though the extremes are not representative, they can help us frame how to tackle the problem. We might find an optimization for one extreme and then work on it to hold performance for the other extremes constant.

Summary

In conclusion, microbenchmarks can be a useful technique for quickly exploring the performance landscape. Unlike macrobenchmarks which might take hours to complete to get reliable signals, microbenchmarks can give us information in a minute or two to make a better decision for what to try next while working on a problem.

While microbenchmarks are no substitute for using progressively more realistic benchmarks and evaluating production, imperfect information lets us iterate more quickly to achieve a better solution to perform a final, production evaluation and landing.


Subscribe to the Abseil Blog