Performance Tip of the Week #7: Optimizing for application productivity

Originally posted as Fast TotW #7 on June 6, 2019

Updated 2025-03-25

Overview

Google manages a vast fleet of servers to handle search queries, process records, and transcode cat videos. We don’t buy those servers to allocate memory in TCMalloc, put protocol buffers into other protocol buffers, or to handle branch mispredictions by our processors.

To make our fleet more efficient, we want to optimize for how productive our servers are, that is, how much useful work they accomplish per CPU-second, byte-second of RAM, disk operation, or by using hardware accelerators. While measuring a job’s resource consumption is easy, it’s harder to tell just how much useful work it’s accomplishing without help.

A task’s CPU usage going up could mean the task has suffered a performance regression or that it’s simply busier. Consider a plot of a service’s CPU usage against time, breaking down the total CPU usage of two versions of the binary. We cannot determine from casual inspection what caused the increase in CPU usage, whether this is from an increase in workload (serving more videos per unit time) or a decrease in efficiency (some added, needless protocol conversion per video).

To determine what is really happening, we need a productivity metric which captures the amount of real work completed. If we know the number of cat videos processed, we can easily determine whether we are getting more, or less, real work done per CPU-second (or byte-second of RAM, disk operation, or hardware accelerator time). These metrics are referred to as application productivity metrics, or APMs.

If we do not have productivity metrics, we are faced with entire classes of optimizations that are not well-represented by existing metrics:

Application speedups through core infrastructure changes:

As seen in our 2021 OSDI paper, “one classical approach is to increase the efficiency of an allocator to minimize the cycles spent in the allocator code. However, memory allocation decisions also impact overall application performance via data placement, offering opportunities to improve fleetwide productivity by completing more units of application work using fewer hardware resources.”

Experiments with TCMalloc’s hugepage-aware allocator, also known as Temeraire, have shown considerable speedups by improving application performance, not time spent in TCMalloc.

We spend more relative time in TCMalloc but greatly improve application performance. Focusing just on relative time in TCMalloc would produce an error in sign: We’d deprioritize (or even rollback) a strongly positive optimization.
Allocating more protocol buffer messages on Arenas speeds up not just the protocol buffer code itself (like message destructors), but also in the business logic that uses them. Enabling Arenas in major frameworks allowed them to process 15-30% more work per CPU, but protobuf destructor costs were a small fraction of this cost. The improvements in data locality could produce outsized benefits for the entire application.
New instruction sets: With successive hardware generations, vendors have added new instructions to their ISAs.

In future hardware generations, we expect to replace calls to memcpy with microcode-optimized rep movsb instructions that are faster than any handwritten assembly sequence we can come up with. We expect rep movsb to have low IPC (instructions per cycle): It’s a single instruction that replaces an entire copy loop of instructions!

Using these new instructions can be triggered by optimizing the source code or through compiler enhancements that improve vectorization.

Focusing on MIPS (millions of instructions per second) or IPC would cause us to prefer any implementation that executes a large number of instructions, even if those instructions take longer to execute to copy n bytes.

In fact, enabling the AVX, FMA, and BMI instruction sets by compiling with --march=haswell shows a MIPS regression while simultaneously improving application productivity. These instructions can do more work per instruction, however, replacing several low latency instructions may mean that average instruction latency increases. If we had 10 million instructions and 10 ms per query, we may now have 8 million instructions taking only 9 ms per query. QPS is up and MIPS would go down.

Since Google’s fleet runs on a wide variety of architectures, we cannot easily compare instructions across platforms and need to instead compare useful work accomplished by an application.
Compiler optimizations: Compiler optimizations can significantly affect the number of dynamically executed instructions. Techniques such as inlining reduce function preambles and enable further simplifying optimizations. Thus, fewer instructions translate to faster, more productive code.
Kernel optimizations: The kernel has many policies around hugepages, thread scheduling, and other system parameters. While changing these policies may make the kernel nominally more costly, for example, if we did more work to compact memory, the application benefits can easily outweigh them.

Availability of these metrics help infrastructure and efficiency teams guide their work more effectively.

Performance Guide

Fast Tips

Fast TotW #7: Optimizing for application productivity

Fast TotW #9: Optimizations past their prime

Fast TotW #21: Improving the efficiency of your regular expressions

Fast TotW #39: Beware microbenchmarks bearing gifts

Fast TotW #52: Configuration knobs considered harmful

Fast TotW #53: Precise C++ benchmark measurements with Hardware Performance Counters

Fast TotW #60: In-process profiling: lessons learned

Fast TotW #62: Identifying and reducing memory bandwidth needs

Fast TotW #64: More Moore with better API design

Fast TotW #70: Defining and measuring optimization success

Fast TotW #72: Optimizing optimization

Fast TotW #74: Avoid sweeping street lights under rugs

Fast TotW #75: How to microbenchmark

Fast TotW #79: Make at most one tradeoff at a time

Fast TotW #83: Reducing memory indirections

Fast TotW #87: Two-way doors

Fast TotW #88: Measurement methodology: Avoid the jelly beans trap

Fast TotW #90: How to estimate

Fast TotW #93: Robots never sleep

Performance Tip of the Week #7: Optimizing for application productivity

Overview