Performance Tip of the Week #9: Optimizations past their prime

Originally posted as Fast TotW #9 on June 24, 2019

Updated 2025-03-27

Overview

Optimizations don’t always age gracefully. Faster yesterday might mean slower today.

Benchmarks citing performance on Intel Pentium 3’s or AMD Opterons may have been meaningful several years ago, but optimization equilibria, originally chosen for long-unplugged platforms, may have changed since. Let’s look at a couple examples where well-intended optimizations ultimately hurt performance in the long run.

Popcount

In 2008, Intel introduced the popcnt instruction to determine the number of set bits in a 32- or 64-bit integer. This is of interest for computing hamming distances and a bunch of other things. Without the instruction, we can use a slightly more complex, longer sequence of shifts, bitwise ands, and adds to achieve the same.

Google’s predecessor library to C++20’s bit manipulation functions offered two population count routines. CountOnes64 uses popcnt if the compiler knows it’s going to be available and a fallback otherwise. CountOnes64withPopcount unconditionally uses popcnt on x86_64 machines.

When this instruction first started rolling out, it made sense to test for the availability of the instruction at runtime before choosing which one to call. This would be faster for crunching data on the machines with the instruction and we’d avoid SIGILL‘ing on machines without it.

Years later, every x86_64 machine in the fleet supported this instruction, so CountOnes64 and CountOnes64withPopcount should be the same, right?

Unfortunately, we’re still paying for all of the runtime dispatch machinery to check availability of the instruction, even though the answer is always “yes.” We could make this a compile time constant, but this is akin to picking up rocks faster when we should have instead just left them on the ground. The branch cost might seem trivial, but there’s actually more to be concerned about here.

Prior to cleanups, the implementations weren’t the same.

CountOnes64withPopcount used inline assembly to unconditionally emit the popcnt instruction. The inline assembly prevented the compiler from working around a false dependency bug in some processors.
When the compiler built-in is used (the “slow” version), we actually end up with a better sequence of machine code and can perform stronger optimizations at compile-time around constant folding.

Once the compiler began emitting the popcnt instruction for __builtin_popcount, CountOnes64 was the unconditionally better implementation.

Ironically, code using runtime dispatch to select a “fast” implementation is (unintentionally) preferring an actually slower implementation (CountOnes64withPopcount) and paying for the privilege.

`CHECK_EQ`

In 2005, Google implemented its CHECK logging macros in terms of CheckOpString, a thin wrapper around a std::string*. This later underwent further optimizations, adding hints to the compiler optimizer that the comparison would likely be true.

As of early 2019, a simplified implementation for optimized builds for CHECK_EQ(a, b), after preprocessing looked like:

template<typename T1, typename T2>
std::string* MakeCheckOpString(const T1& a, const T2& b, const char*);

std::string* Check_EQ_Impl(const T1& a, const T2& b, const char* error) {
  if (ABSL_PREDICT_TRUE(a == b))
    return nullptr;
  else
    return MakeCheckOpString(a, b, error);
}

struct CheckOpString {
  CheckOpString(std::string* str) : str_(str) {}
  operator bool() const { return ABSL_PREDICT_FALSE(str_ != nullptr); }
  std::string* str_;
};

#define CHECK_EQ(a, b)                                                \
  while (CheckOpString _result = Check_EQ_Impl(a, b, "...error..."))  \
    ...log failure...

When LLVM generated assembly for this code, it made two checks:

a == b: We predict that this is typically true.
str_ != nullptr: We predict that this is typically false.

…but the second check is redundant. Once we’ve determined that a == b, str_ is always nullptr. This is a missed compiler optimization, but the optimizer faces the challenge of layers of complexity and hand-tuning added to this code over a decade.

Removing CheckOpString completely removes the extra branch: We compare a and b, but the optimizer does not need to reason about CheckOpString. Working directly with std::string* for the comparison leads to better code. Ironically, this optimization had already been applied to debug builds, added in 2008.

Best practices

Prefer writing clear, idiomatic code whenever possible. It is not only easier to read and debug, but in the long run, also easier for the compiler to optimize.
Whenever you find a low-level performance optimization that requires fancy bit-twiddling, intrinsics code, or inline assembly, consider first whether this is something the compiler could do.
If the code is hot, and the optimization is not something the compiler can be taught to perform, then: prefer portable code, possibly using hwy to generate efficient and portable vector code, failing that use intrinsics, failing that use inline asm (this should be extremely rare). Avoiding inline assembly makes the code more portable across microarchitectures.
Keep the “naive” code you are replacing. If you are optimizing ComputeFoo, consider keeping the simple implementation in a REFERENCE_ComputeFoo function. This makes it easy to write a unit-test for the new implementation that ensures the two functions are equivalent; it makes it easier to write a microbenchmark; and it makes it easier to revert to the reference code when (not if) the machine-dependent implementation outlives its usefulness.
Include a microbenchmark with your change.
When designing or changing configuration knobs, ensure that the choices stay optimal over time. Frequently, overriding the default can lead to suboptimal behavior when the default changes by pinning things in a worse-than-out-of-the-box state. Designing the knobs in terms of the outcome rather than specific behavior aspects can make such overrides easier (or even possible) to evolve.

Performance Guide

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate

Fast TotW #93: Robots never sleep

Performance Tip of the Week #9: Optimizations past their prime

Overview

Popcount

`CHECK_EQ`

Best practices