Performance Tip of the Week #64: More Moore with better API design

Originally posted as Fast TotW #64 on October 21, 2022

Updated 2025-03-24

Optimizing library implementations only carries us so far in making software more efficient. In this episode, we discuss the importance of good APIs and the right abstractions for finding optimization opportunities. As we can make the hardware–especially with the end of Moore’s Law–and software run only so fast, the right abstractions give us continued optimization opportunities.

Correctness is paramount

We can simplify an implementation down to return 42; regardless of the input to see blazing fast results, but an API that doesn’t work correctly isn’t doing its job.

“Subtle” and “clever” code has costs for both maintainers and users alike. Today’s tricky edge cases can be tomorrow’s headaches when we try to optimize an implementation. Threading the needle of preserving explicitly (or implicitly) promised quirks makes the optimization process slower and more fragile over time. Being able to iterate faster helps with exploring more of the design space to find the best minima.

At times, we may need to break abstraction boundaries or have complex preconditions to unlock the best possible performance. We need to document and test these sharp edges. Future debugging has an opportunity cost: When we spend time tracking down and fixing bugs, we are not developing new optimizations. We can use assertions for preconditions, especially in debug/sanitizer builds, to double-check contracts and enforce them. Testing robots never sleep, while humans are fallible. Randomized implementation behaviors provide a useful bulwark against Hyrum’s Law from creeping in to implicitly expand the contract of an interface.

Express intents

Small, composable operations give users flexibility to express their intents more clearly. We can find optimizations by combining high-level but related concepts.

Consider memcpy and a hypothetical memcpy_but_faster API that we could build. They both express the same intent, but presumably with different tradeoffs around performance.

Users need to think about which one to call. This adds a cognitive cost to every call site. They cannot quickly reach for precisely one to realize their desired functionality. When in doubt, typing fewer characters is faster. Over time, choices made will be incorrect, either because they were suboptimal from the start or circumstances changed.
Bifurcating the API gives us two implementations, each with less usage. This lowers the leverage from developing optimizations to one, unless its maintainers can reliably cross-pollinate ideas from one to the other. Actively maintaining two implementations requires a larger investment, reducing the RoI from having two in the first place. Engineers may give the more commonly used implementation more care and attention, leading it to eventually outstrip the “faster” implementation.
Data structures and types can be especially costly to duplicate, due to the “impedance mismatch” of having a library that works solely with one type (say std::string) and another that needs a different one (absl::my_fast_string). In order for the two to interoperate, the interfaces will require expensive copies–a single type would not require such conversions.

While this hypothetical might seem far-fetched, this is precisely what happened with the predecessor implementation to absl::popcount. We had two implementations, but the “better” one was ultimately outstripped by the “worse” one because engineers optimized the one with the wider usage instead.

In terms of API design around intents, we can consider:

void* memcpy(void* dest, const void* src, size_t count);
crc32c_t absl::ComputeCrc32c(absl::string_view buf);
crc32c_t absl::MemcpyCrc32c(void* dest, const void* src, size_t count);

With the first two primitives, we can build a trivial, but non-optimal implementation for the third. Combining the concepts makes sense when it is a common operation where finer-grained operations might leave performance on the table. Knowing we are going to both copy and checksum the bytes allows us to read data once, rather than twice. We can decompose the implementation into its components, as well, if that ever became more efficient.

The extra output of the operation (the crc32c_t) distinguishes it from just being a memcpy with different performance characteristics. We would recommend using the combined operation when we need to both copy data and checksum it. MemcpyCrc32c isn’t a suitable replacement for calls to memcpy without a need for a checksum, which removes the cognitive cost of considering it solely for performance reasons.

The explicit function calls can also help with understanding the purpose of the code when we are looking at profiles later. For example, we can compare protobufs for equality in two ways:

By serializing and comparing the bytes, which is unfortunately both common and unsound.
Field-by-field directly, which is faster.

While reading a profile, we might see the individual calls to serialize and memcmp, but it is harder to ascertain the intended semantics later. We may be tempted to optimize the discrete functions–the process of serializing and subsequent the process of comparing the resulting string. Understanding the high-level intent and data flow gives us opportunities to optimize further up the stack to find the “Room at the Middle”, optimizing the direct comparison. At a minimum, an optimized version could avoid holding the serialized versions in memory.

Avoid unnecessarily strong guarantees

There are situations where the benefits of duplicate APIs outweigh the costs.

The Abseil hash containers (SwissMap) added new hashtable implementations to the codebase, which at first glance, appear redundant with the ones in the C++ standard library. This apparent duplication allowed us to have a more efficient set of containers which match the standard library API, but adhere to a weaker set of constraints.

The Abseil hash containers provided weaker guarantees for iterator and pointer stability, allowing them to improve performance by reducing data indirections. It is difficult to implement std::unordered_map’s guarantees without resorting to a node-based implementation that requires data indirections and constrains performance. Given std::unordered_map’s widespread usage, it was not feasible to relax these guarantees all at once.

Node-based containers necessitate implementation overheads, but they come with a direct benefit: They actively facilitate migration while allowing weaker containers to be available. Making a guarantee stronger without an accompanying benefit is undesirable.

The migration was a replacement path for the legacy containers, not an alternative. The superior performance characteristics meant that users could “just use SwissMap” without tedious benchmarking on a case-by-case basis. There’s little need for a user to revisit their decision to migrate to SwissMap with the passage of time. This meant that usage could be actively driven towards SwissMap: Two types would be a temporary (albeit long) state, rather than one where every individual usage had to be carefully selected.

Years after SwissMap’s development, there are far fewer–but non-zero–uses of std::unordered_map. Blocking the improvement on the complete cleanup means no benefit would have accrued. We were able to migrate instance-by-instance, realizing incremental benefits over time.

It’s important to avoid ascribing intent–even with expressive APIs–to use of a previously predominant one. A use of std::map might require keys to be ordered, but the more likely explanation might be that it is older code in need of updating.

Avoid leaking implementation details

Hyrum’s Law reminds us that observable behaviors will be relied upon, but sometimes our API design choices constrain our implementation details. These often arise from returning references to data or giving fine-grained control in APIs. This can help performance in the short-term, but care is required to make sure it allows long-term evolution to continue to improve performance over time.

Consider protocol buffers for a simple message.

message MyMessage {
  optional string foo = 1;
  repeated string bar = 2;
}

As of October 2023, the accessor .foo() returns a const std::string&. This requires that we have an in-memory representation of a std::string instance that can be returned. This approach has two problems:

std::string encodes a specific allocation strategy (std::allocator). If we change the allocation strategy, for example wrapping Arena, we change the type.
Individual fields can have a wide range of sizes (or likelihoods of presence) that we can determine from profiling, which could benefit from variable small string object buffer sizes. Returning const std::string& constrains the implementation to that particular size of buffer.

In contrast, by returning std::string_view (or our internal predecessor, StringPiece), we decouple callers from the internal representation. The API is the same, independent of whether the string is constant data (backed by the .rodata section), allocated on the heap by a std::string instance, or allocated by an Arena. We’ve abstracted away the implementation detail from our user, giving us more optimization freedom.

Similarly, consider the allocation-aware APIs in protobuf, add_allocated_..., release_..., and unsafe_arena_.... Fine-grained control over when and where allocations occur can offer significant performance benefits, but they also constrain future implementations by creating sharp performance edges.

release_... allows us to remove a submessage and return ownership to the caller. Subobjects were heap allocated and the operation was fast–it’s hard to beat swapping two pointers. When Protobuf Arenas became available, release_... created a new copy of the underlying message on the heap, so it could release that. The API couldn’t convey that the returned pointer was owned by the Arena, not caller, so making a full copy was required to keep code working. As a result, code that calls release_... may be O(1) or O(n) based on non-local information (whether the source object was constructed on an arena)!
With Arenas, unsafe_arena_... gives us the raw hooks we need to add or remove fields from a message without making the copy mentioned above , with “unsafe” in the name conveying the subtlety and gravitas of what we’re doing. These APIs are tricky to use correctly, though, as today’s tested combination of arena and heap ownership may change over time and assumptions break. The APIs are also extremely fine-grained, but do not convey the higher-level intent–transferring pointer ownership, “lending” a submessage to another one, etc.

Concluding remarks

Good performance should be available by default, not an optional feature. While feature flags and knobs can be useful for testing and initial rollout, we should strive to make the right choices for users, rather than requiring users adopt the improvement on a case-by-case basis.

Developing an optimization for an existing implementation can provide a larger return-on-investment by targeting widespread, current usage upfront. Adding a new API or optimization knob can be expedient, but without widespread usage and adoption, the benefit is far more limited.

Optimization of existing code can hit stumbling blocks around unnecessarily strong guarantees or APIs that constrain the implementation–and thus the optimization search space–too much to find improvements.

Performance Guide

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate

Fast TotW #93: Robots never sleep