Originally posted as Fast TotW #64 on October 21, 2022
Updated 2023-10-10
Quicklink: abseil.io/fast/64
Optimizing library implementations only carries us so far in making software more efficient. In this episode, we discuss the importance of good APIs and the right abstractions for finding optimization opportunities. As we can make the hardware–especially with the end of Moore’s Law–and software run only so fast, the right abstractions give us continued optimization opportunities.
We can simplify an implementation down to return 42;
regardless of the input
to see blazing fast results, but an API that doesn’t work correctly isn’t doing
its job.
“Subtle” and “clever” code has costs for both maintainers and users alike. Today’s tricky edge cases can be tomorrow’s headaches when we try to optimize an implementation. Threading the needle of preserving explicitly (or implicitly) promised quirks makes the optimization process slower and more fragile over time. Being able to iterate faster helps with exploring more of the design space to find the best minima.
At times, we may need to break abstraction boundaries or have complex preconditions to unlock the best possible performance. We need to document and test these sharp edges. Future debugging has an opportunity cost: When we spend time tracking down and fixing bugs, we are not developing new optimizations. We can use assertions for preconditions, especially in debug/sanitizer builds, to double-check contracts and enforce them. Testing robots never sleep, while humans are fallible. Randomized implementation behaviors provide a useful bulwark against Hyrum’s Law from creeping in to implicitly expand the contract of an interface.
Small, composable operations give users flexibility to express their intents more clearly. We can find optimizations by combining high-level but related concepts.
Consider memcpy
and a hypothetical memcpy_but_faster
API that we could
build. They both express the same intent, but presumably with
different tradeoffs around performance.
std::string
) and
another that needs a different one (absl::my_fast_string
). In order for
the two to interoperate, the interfaces will require expensive copies–a
single type would not require such conversions.While this hypothetical might seem far-fetched, this is precisely what happened
with the predecessor implementation to absl::popcount
. We had two
implementations, but the “better” one was ultimately outstripped by the “worse”
one because engineers optimized the one with the wider usage instead.
In terms of API design around intents, we can consider:
void* memcpy(void* dest, const void* src, size_t count);
crc32c_t absl::ComputeCrc32c(absl::string_view buf);
crc32c_t absl::MemcpyCrc32c(void* dest, const void* src, size_t count);
With the first two primitives, we can build a trivial, but non-optimal implementation for the third. Combining the concepts makes sense when it is a common operation where finer-grained operations might leave performance on the table. Knowing we are going to both copy and checksum the bytes allows us to read data once, rather than twice. We can decompose the implementation into its components, as well, if that ever became more efficient.
The extra output of the operation (the crc32c_t
) distinguishes it from just
being a memcpy
with different performance characteristics. We would recommend
using the combined operation when we need to both copy data and checksum it.
MemcpyCrc32c
isn’t a suitable replacement for calls to memcpy
without a need
for a checksum, which removes the cognitive cost of considering it solely for
performance reasons.
The explicit function calls can also help with understanding the purpose of the code when we are looking at profiles later. For example, we can compare protobufs for equality in two ways:
While reading a profile, we might see the individual calls to serialize and
memcmp
, but it is harder to ascertain the intended semantics later. We may be
tempted to optimize the discrete functions–the process of serializing and
subsequent the process of comparing the resulting string. Understanding the
high-level intent and data flow gives us opportunities to optimize further up
the stack to find the “Room at the Middle”, optimizing the direct comparison. At
a minimum, an optimized version could avoid holding the serialized versions in
memory.
There are situations where the benefits of duplicate APIs outweight the costs.
The Abseil hash containers (SwissMap) added new hashtable implementations to the code base, which at first glance, appear redundant with the ones in the C++ standard library. This apparent duplication allowed us to have a more efficient set of containers which match the standard library API, but adhere to a weaker set of constraints.
The Abseil hash containers provided weaker guarantees for iterator and pointer
stability, allowing them to improve performance by reducing data indirections.
It is difficult to implement std::unordered_map
’s guarantees without resorting
to a node-based implementation that requires data indirections and constrains
performance. Given std::unordered_map
’s widespread usage, it was not feasible
to relax these guarantees all at once.
The migration was a replacement path for the legacy containers, not an alternative. The superior performance characteristics meant that users could “just use SwissMap” without tedious benchmarking on a case-by-case basis. There’s little need for a user to revisit their decision to migrate to SwissMap with the passage of time. This meant that usage could be actively driven towards SwissMap: Two types would be a temporary (albeit long) state, rather than one where every individual usage had to be carefully selected.
Years after SwissMap’s development, there are far fewer–but non-zero–uses of
std::unordered_map
. Blocking the improvement on the complete cleanup means no
benefit would have accrued. We were able to migrate instance-by-instance,
realizing incremental benefits over time.
It’s important to avoid ascribing intent–even with expressive APIs–to use of a
previously predominant one. A use of std::map
might require keys to be
ordered, but the more likely explanation might be that it is older code in need
of updating.
Hyrum’s Law reminds us that observable behaviors will be relied upon, but sometimes our API design choices constrain our implementation details. These often arise from returning references to data or giving fine-grained control in APIs. This can help performance in the short-term, but care is required to make sure it allows long-term evolution to continue to improve performance over time.
Consider protocol buffers for a simple message.
message MyMessage {
optional string foo = 1;
repeated string bar = 2;
}
As of October 2023, the accessor .foo()
returns a const std::string&
. This
requires that we have an in-memory representation of a std::string
instance
that can be returned. This approach has two problems:
std::string
encodes a specific allocation strategy (std::allocator
). If
we change the allocation strategy, for example wrapping Arena
, we change
the type.const std::string&
constrains the implementation to that particular size of buffer.In contrast, by returning std::string_view
(or our
internal predecessor,
StringPiece
), we decouple callers from the internal representation. The API is
the same, independent of whether the string is constant data (backed by the
.rodata
section), allocated on the heap by a std::string
instance, or
allocated by an Arena
. We’ve abstracted away the implementation detail from
our user, giving us more optimization freedom.
Similarly, consider the allocation-aware APIs in protobuf, add_allocated_...
,
release_...
, and unsafe_arena_...
. Fine-grained control over when and where
allocations occur can offer significant performance benefits, but they also
constrain future implementations by creating sharp performance edges.
release_...
allows us to remove a submessage and return ownership to the
caller. Subobjects were heap allocated and the operation was fast–it’s hard
to beat swapping two pointers. When Protobuf Arenas became available,
release_...
created a new copy of the underlying message on the heap, so
it could release that. The API couldn’t convey that the returned pointer was
owned by the Arena, not caller, so making a full copy was required to keep
code working. As a result, code that calls release_...
may be O(1) or O(n)
based on non-local information (whether the source object was constructed on
an arena)!unsafe_arena_...
gives us the raw hooks we need to add or
remove fields from a message without making the copy mentioned above , with
“unsafe” in the name conveying the subtlety and gravitas of what we’re
doing. These APIs are tricky to use correctly, though, as today’s tested
combination of arena and heap ownership may change over time and assumptions
break. The APIs are also extremely fine-grained, but do not convey the
higher-level intent–transferring pointer ownership, “lending” a submessage
to another one, etc.Good performance should be available by default, not an optional feature. While feature flags and knobs can be useful for testing and initial rollout, we should strive to make the right choices for users, rather than requiring users adopt the improvement on a case-by-case basis.
Developing an optimization for an existing implementation can provide a larger return-on-investment by targeting widespread, current usage upfront. Adding a new API or optimization knob can be expedient, but without widespread usage and adoption, the benefit is far more limited.
Optimization of existing code can hit stumbling blocks around unnecessarily strong guarantees or APIs that constrain the implementation–and thus the optimization search space–too much to find improvements.