Performance Tip of the Week #70: Defining and measuring optimization success

Originally posted as Fast TotW #70 on June 26, 2023

By Chris Kennelly

Updated 2023-10-20

Quicklink: abseil.io/fast/70

Engineers optimizing performance are ultimately trying to maximize the things Google does (serve search queries, videos on YouTube, etc.) and minimize the things Google buys (CPUs, RAM, disks, etc.). In this episode, we discuss how to choose metrics that help us influence the optimizations we work on, make effective decisions, and measure the outcome of projects.

Economic value

Things like search queries and video playbacks on YouTube represent economic value. Useful work is happening to deliver an experience to an end-user which then translates into Google revenue.

This isn’t simply a matter of adding up all of the requests that happen though: Some requests are more important than others:

  • A search query from a human is valuable. A search query from a prober used for monitoring is important insofar as it helps with monitoring the service’s reliability, but search exists for humans, not for monitoring.
  • A YouTube session might continue to play for several minutes after someone has left their computer. The automatic playback pause after inactivity cuts off a flow of unimportant video plays.
Things we make Things we buy
Searches CPUs
Driving directions RAM
Cat videos TPUs
Email messages Disks
Cloud compute VM time Electricity
Starcraft  

Maximizing value and minimizing costs are the outcomes we are ultimately after.

In the course of handling requests across servers, there’s work that happens independently from user requests:

  • Above, we touched on probing for monitoring.
  • Request hedging helps give users consistent latency by sending duplicative requests: One of the requests and the work done might prove to be unnecessary. Taken to an exaggerated extreme, sending multiple hedged requests unconditionally is wasteful: Users don’t get meaningfully better latency and production uses resources profligately. Choosing the right time to hedge can let us strike a balance.
  • Deadline propagation allows child requests to backends to time out early, rather than continue to perform work believing the requestor will still use it. This is an important technique for minimizing unproductive work, especially during overload.
  • Load testing a machine or cluster to the point of overload produces valuable knowledge for capacity planning and validates that load shedding mechanisms work. The requests do not directly answer user requests; but without this knowledge, the user experience would eventually suffer as regressions go unnoticed.

Many of these activities are effectively a cost of doing business. It’s hard to run a reliable, low-latency service distributed across numerous servers without them. Nonetheless, it’s important to not let the tail wag the dog here: Monitoring or load testing are not an end unto themselves.

Selecting good proxies

Along the way, there are proxy metrics that can help with telling us that our optimization idea is on the right track, or help to explain the causal connection to top-level metrics. We want to align with the business problem, without boiling the ocean every time we make a small change and want to assess it. Measurement has its own return on investment too and the benefits of additional precision is quickly outweighed by the cost of obtaining it.

Goodhart’s Law reminds us that “when a measure becomes a target, it ceases to be a good measure.” Escaping this completely is challenging, but analysis is easier the more closely aligned the metric is with what we’re optimizing.

One common leap that we might need to make is connecting an abstract or harder to measure end goal such as business value or user satisfaction to more easily measured metrics. Totals of RPC requests made or their latency are common proxies for this.

In working on optimizations, we also need to optimize our feedback loop for developing optimizations. For example,

  • Microbenchmarks provide a much shorter feedback loop than measuring application performance in macrobenchmarks or production. We can build, run, and observe a microbenchmark in a manner of minutes. They have limitations, but as long as we’re mindful of those pitfalls, they can get us directional information much more quickly.
  • PMU counters can tell us rich details about bottlenecks in code such as cache misses or branch mispredictions. Seeing changes in these metrics can be a proxy that helps us understand the effect. For example, inserting software prefetches can reduce cache miss events, but in a memory bandwidth-bound program, the prefetches can go no faster than the “speed of light” of the memory bus. Similarly, eliminating a stall far off the critical path might have little bearing on the application’s actual performance.

If we expect to improve an application’s performance, we might start by taking a large function in the CPU profile and finding an optimization for it–say by changing to a more cache-friendly data structure. The reduction in cache misses and improvement in microbenchmark times help validate that the optimization is working according to our mental model of the code being optimized. We avoid false positives by doing so: Changing the font color of a webpage to green and running a loadtest might give a positive result purely by chance, not due to a causal effect.

These development-time proxies help us get an understanding of bottlenecks and performance improvements. We still need to measure the impact on application and service-level performance, but the proxies help us hone in on an optimization that we want to deploy faster.

Aligning with success

The metrics we pick need to align with success. If a metric tells us to do the opposite of a seemingly good thing, the metric is potentially flawed.

For example, Flume tracks useful work done by workers in terms of records processed. While no default is ever perfect–this elides that records can be of varying workloads, shapes, and sizes–it better aligns with other optimizations than bytes processed. With static and dynamic field tracking, Flume can read a subset of fields from every record. The total number of records is unchanged, but the total number of bytes goes down and total pipeline costs fall as well. Comparing the two possible metrics:

  • With records as the denominator, we see a drop in resources required per record processed with this optimization. ie the compute per record decreases with optimization.
  • With bytes as the denominator, absolute resource usage does go down, but fixed overheads stay the same. As a proportion of the total pipeline’s cost, fixed costs grow and are amortized over fewer bytes. ie the compute per byte may actually increase with optimization. This might make an optimization–skipping unnecessary data–look like an apparent regression.

In other cases, we want to normalize for the amount of work done. For example, a service where the size of the response depends on the request will likely want to track the number of returned bytes as its work done. A video transcoding service similarly needs to count the number of pixels: Videos at higher resolution require more processing time than lower resolution ones, and this roughly normalizes the higher difficulty per frame.

Pitfalls

Instructions per clock (IPC) is a challenging metric to use as a proxy. While executing more instructions in less time is generally good–for example, because we reduced cache misses by optimizing things–there are other times where it is worse. A thread spinning for a locked SpinLock is not making forward progress, despite having high IPC. Similarly, using vector instructions, rep movsb, or differences in microarchitecture allows us to accomplish more useful work in a single instruction. Optimizing for IPC or instructions can lead us to prefer behaviors that are worse for application performance.

Similarly, relative time in libraries is a useful yardstick for finding places to optimize and tracking their costs. In the long run, though, optimizations that speed up the whole program might come at the “cost” of this proxy.

Distributed systems complicate metrics as well. We still want to make sure the high level goal–business value per TCO (total cost of ownership) is maximized, but we may be able to put more precise calipers on subsystems to detect improvements (or regressions). Task-level metrics such as throughput and latency may not translate to the overall system’s performance: Optimizing the latency of a task off the critical path may have little impact on the overall, user-facing latency of a service. On the other hand, throughput improvements, allowing us to do more work with less CPU time and memory per request, allow us to use fewer resources to handle the same workload.

Summary

The goal of optimization projects is to maximize value–serving search queries, videos on YouTube, etc–and minimize costs–CPUs, RAM, disks, etc. Similar to how we can carefully use microbenchmarks to predict macrobenchmarks to predict production, we can select proxy metrics to measure success. This lets us align with business goals, especially harder to measure ones, while still having an effective yardstick for day-to-day work.


Subscribe to the Abseil Blog