Originally posted as Fast TotW #70 on June 26, 2023
Updated 2023-10-20
Quicklink: abseil.io/fast/70
Engineers optimizing performance are ultimately trying to maximize the things Google does (serve search queries, videos on YouTube, etc.) and minimize the things Google buys (CPUs, RAM, disks, etc.). In this episode, we discuss how to choose metrics that help us influence the optimizations we work on, make effective decisions, and measure the outcome of projects.
Things like search queries and video playbacks on YouTube represent economic value. Useful work is happening to deliver an experience to an end-user which then translates into Google revenue.
This isn’t simply a matter of adding up all of the requests that happen though: Some requests are more important than others:
Things we make | Things we buy |
---|---|
Searches | CPUs |
Driving directions | RAM |
Cat videos | TPUs |
Email messages | Disks |
Cloud compute VM time | Electricity |
Starcraft |
Maximizing value and minimizing costs are the outcomes we are ultimately after.
In the course of handling requests across servers, there’s work that happens independently from user requests:
Many of these activities are effectively a cost of doing business. It’s hard to run a reliable, low-latency service distributed across numerous servers without them. Nonetheless, it’s important to not let the tail wag the dog here: Monitoring or load testing are not an end unto themselves.
Along the way, there are proxy metrics that can help with telling us that our optimization idea is on the right track, or help to explain the causal connection to top-level metrics. We want to align with the business problem, without boiling the ocean every time we make a small change and want to assess it. Measurement has its own return on investment too and the benefits of additional precision is quickly outweighed by the cost of obtaining it.
Goodhart’s Law reminds us that “when a measure becomes a target, it ceases to be a good measure.” Escaping this completely is challenging, but analysis is easier the more closely aligned the metric is with what we’re optimizing.
One common leap that we might need to make is connecting an abstract or harder to measure end goal such as business value or user satisfaction to more easily measured metrics. Totals of RPC requests made or their latency are common proxies for this.
In working on optimizations, we also need to optimize our feedback loop for developing optimizations. For example,
If we expect to improve an application’s performance, we might start by taking a large function in the CPU profile and finding an optimization for it–say by changing to a more cache-friendly data structure. The reduction in cache misses and improvement in microbenchmark times help validate that the optimization is working according to our mental model of the code being optimized. We avoid false positives by doing so: Changing the font color of a webpage to green and running a loadtest might give a positive result purely by chance, not due to a causal effect.
These development-time proxies help us get an understanding of bottlenecks and performance improvements. We still need to measure the impact on application and service-level performance, but the proxies help us hone in on an optimization that we want to deploy faster.
The metrics we pick need to align with success. If a metric tells us to do the opposite of a seemingly good thing, the metric is potentially flawed.
For example, Flume tracks useful work done by workers in terms of records processed. While no default is ever perfect–this elides that records can be of varying workloads, shapes, and sizes–it better aligns with other optimizations than bytes processed. With static and dynamic field tracking, Flume can read a subset of fields from every record. The total number of records is unchanged, but the total number of bytes goes down and total pipeline costs fall as well. Comparing the two possible metrics:
In other cases, we want to normalize for the amount of work done. For example, a service where the size of the response depends on the request will likely want to track the number of returned bytes as its work done. A video transcoding service similarly needs to count the number of pixels: Videos at higher resolution require more processing time than lower resolution ones, and this roughly normalizes the higher difficulty per frame.
Instructions per clock (IPC) is a challenging metric to use as a proxy. While
executing more instructions in less time is generally good–for example, because
we reduced cache misses by optimizing things–there are other times where it is
worse. A thread spinning for a locked SpinLock
is not making forward progress,
despite having high IPC. Similarly, using vector instructions, rep movsb
, or
differences in microarchitecture allows us to accomplish more useful work in a
single instruction. Optimizing for IPC or instructions can lead us to prefer
behaviors that are worse for application performance.
Similarly, relative time in libraries is a useful yardstick for finding places to optimize and tracking their costs. In the long run, though, optimizations that speed up the whole program might come at the “cost” of this proxy.
Distributed systems complicate metrics as well. We still want to make sure the high level goal–business value per TCO (total cost of ownership) is maximized, but we may be able to put more precise calipers on subsystems to detect improvements (or regressions). Task-level metrics such as throughput and latency may not translate to the overall system’s performance: Optimizing the latency of a task off the critical path may have little impact on the overall, user-facing latency of a service. On the other hand, throughput improvements, allowing us to do more work with less CPU time and memory per request, allow us to use fewer resources to handle the same workload.
The goal of optimization projects is to maximize value–serving search queries, videos on YouTube, etc–and minimize costs–CPUs, RAM, disks, etc. Similar to how we can carefully use microbenchmarks to predict macrobenchmarks to predict production, we can select proxy metrics to measure success. This lets us align with business goals, especially harder to measure ones, while still having an effective yardstick for day-to-day work.