Originally posted as Fast TotW #9 on June 24, 2019
Updated 2023-10-10
Quicklink: abseil.io/fast/9
Optimizations don’t always age gracefully. Faster yesterday might mean slower today.
Benchmarks citing performance on Intel Pentium 3’s or AMD Opterons may have been meaningful several years ago, but optimization equilibria, originally chosen for long-unplugged platforms, may have changed since. Let’s look at a couple examples where well-intended optimizations ultimately hurt performance in the long run.
In 2008, Intel introduced the popcnt
instruction to determine the number of
set bits in a 32- or 64-bit integer. This is of interest for computing hamming
distances and a bunch of other things. Without the instruction, we can use a
slightly
more complex, longer sequence
of shifts, bitwise ands, and adds to achieve the same.
Google’s predecessor library to C++20’s
bit manipulation functions
offered two population count routines. CountOnes64
uses popcnt
if the
compiler knows it’s going to be available and a fallback otherwise.
CountOnes64withPopcount
unconditionally uses popcnt
on x86_64 machines.
When this instruction first started rolling out, it made sense to test for the
availability of the instruction at runtime before choosing which one to call.
This would be faster for crunching data on the machines with the instruction and
we’d avoid SIGILL
‘ing on machines without it.
Years later, every x86_64
machine in the fleet supported this instruction, so
CountOnes64
and CountOnes64withPopcount
should be the same, right?
Unfortunately, we’re still paying for all of the runtime dispatch machinery to check availability of the instruction, even though the answer is always “yes.” We could make this a compile time constant, but this is akin to picking up rocks faster when we should have instead just left them on the ground. The branch cost might seem trivial, but there’s actually more to be concerned about here.
Prior to cleanups, the implementations weren’t the same.
CountOnes64withPopcount
used inline assembly to unconditionally emit the
popcnt
instruction. The inline assembly prevented the compiler from
working around a
false dependency bug
in some processors.Once the compiler began emitting the popcnt
instruction for
__builtin_popcount
, CountOnes64
was the unconditionally better
implementation.
Ironically, code using runtime dispatch to select a “fast” implementation is
(unintentionally) preferring an actually slower implementation
(CountOnes64withPopcount
) and paying for the privilege.
CHECK_EQ
In 2005, Google implemented its CHECK
logging macros in terms of
CheckOpString
, a thin wrapper around a std::string*
. This later underwent
further optimizations, adding hints to the compiler optimizer that the
comparison would likely be true.
As of early 2019, a simplified implementation for optimized builds for
CHECK_EQ(a, b)
, after preprocessing looked like:
template<typename T1, typename T2> std::string* MakeCheckOpString(const T1& a, const T2& b, const char*); std::string* Check_EQ_Impl(const T1& a, const T2& b, const char* error) { if (ABSL_PREDICT_TRUE(a == b)) return nullptr; else return MakeCheckOpString(a, b, error); } struct CheckOpString { CheckOpString(std::string* str) : str_(str) {} operator bool() const { return ABSL_PREDICT_FALSE(str_ != nullptr); } std::string* str_; }; #define CHECK_EQ(a, b) \ while (CheckOpString _result = Check_EQ_Impl(a, b, "...error...")) \ ...log failure...
When LLVM generated assembly for this code, it made two checks:
a == b
: We predict that this is typically true.str_ != nullptr
: We predict that this is typically false.…but the second check is redundant. Once we’ve determined that a == b
,
str_
is always nullptr
. This is a missed compiler optimization, but the
optimizer faces the challenge of layers of complexity and hand-tuning added to
this code over a decade.
Removing CheckOpString
completely removes the extra branch: We compare a
and
b
, but the optimizer does not need to reason about CheckOpString
. Working
directly with std::string*
for the comparison leads to better code.
Ironically, this optimization had already been applied to debug builds, added
in 2008.
ComputeFoo
,
consider keeping the simple implementation in a REFERENCE_ComputeFoo
function. This makes it easy to write a unit-test for the new implementation
that ensures the two functions are equivalent; it makes it easier to write a
microbenchmark; and it makes it easier to revert to the reference code when
(not if) the machine-dependent implementation outlives its usefulness.