Performance Hints

Jeff Dean, Sanjay Ghemawat

Original version: 2023/07/27, last updated: 2025/12/16

Over the years, we (Jeff & Sanjay) have done a fair bit of diving into performance tuning of various pieces of code, and improving the performance of our software has been important from the very earliest days of Google, since it lets us do more for more users. We wrote this document as a way of identifying some general principles and specific techniques that we use when doing this sort of work, and tried to pick illustrative source code changes (change lists, or CLs) that provide examples of the various approaches and techniques. Most of the concrete suggestions below reference C++ types and CLs, but the general principles apply to other languages. The document focuses on general performance tuning in the context of a single binary, and does not cover distributed systems or machine learning (ML) hardware performance tuning (huge areas unto themselves). We hope others will find this useful.

Many of the examples in the document have code fragments that demonstrate the techniques (click the little triangles!). Note that some of these code fragments mention various internal Google codebase abstractions. We have included these anyway if we felt like the examples were self-contained enough to be understandable to those unfamiliar with the details of those abstractions.

The importance of thinking about performance

Knuth is often quoted out of context as saying premature optimization is the root of all evil. The full quote reads: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” This document is about that critical 3%, and a more compelling quote, again from Knuth, reads:

The improvement in speed from Example 2 to Example 2a is only about 12%, and many people would pronounce that insignificant. The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can’t debug or maintain their “optimized” programs. In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering. Of course I wouldn’t bother making such optimizations on a one-shot job, but when it’s a question of preparing quality programs, I don’t want to restrict myself to tools that deny me such efficiencies.

Many people will say “let’s write down the code in as simple a way as possible and deal with performance later when we can profile”. However, this approach is often wrong:

  1. If you disregard all performance concerns when developing a large system, you will end up with a flat profile where there are no obvious hotspots because performance is lost all over the place. It will be difficult to figure out how to get started on performance improvements.
  2. If you are developing a library that will be used by other people, the people who will run into performance problems will be likely to be people who cannot easily make performance improvements (they will have to understand the details of code written by other people/teams, and have to negotiate with them about the importance of performance optimizations).
  3. It is harder to make significant changes to a system when it is in heavy use.
  4. It is also hard to tell if there are performance problems that can be solved easily and so we end up with potentially expensive solutions like over-replication or severe overprovisioning of a service to handle load problems.

Instead, we suggest that when writing code, try to choose the faster alternative if it does not impact readability/complexity of the code significantly.

Estimation

If you can develop an intuition for how much performance might matter in the code you are writing, you can make a more informed decision (e.g., how much extra complexity is warranted in the name of performance). Some tips on estimating performance while you are writing code:

  • Is it test code? If so, you need to worry mostly about the asymptotic complexity of your algorithms and data structures. (Aside: development cycle time matters, so avoid writing tests that take a long time to run.)
  • Is it code specific to an application? If so, try to figure out how much performance matters for this piece of code. This is typically not very hard: just figuring out whether code is initialization/setup code vs. code that will end up on hot paths (e.g., processing every request in a service) is often sufficient
  • Is it library code that will be used by many applications? In this case it is hard to tell how sensitive it might become. This is where it becomes especially important to follow some of the simple techniques described in this document. For example, if you need to store a vector that usually has a small number of elements, use an absl::InlinedVector instead of std::vector. Such techniques are not very hard to follow and don’t add any non-local complexity to the system. And if it turns out that the code you are writing does end up using significant resources, it will be higher performance from the start. And it will be easier to find the next thing to focus on when looking at a profile.

You can do a slightly deeper analysis when picking between options with potentially different performance characteristics by relying on back of the envelope calculations. Such calculations can quickly give a very rough estimate of the performance of different alternatives, and the results can be used to discard some of the alternatives without having to implement them.

Here is how such an estimation might work:

  1. Estimate how many low-level operations of various kinds are required, e.g., number of disk seeks, number of network round-trips, bytes transmitted etc.
  2. Multiply each kind of expensive operation with its rough cost, and add the results together.
  3. The preceding gives the cost of the system in terms of resource usage. If you are interested in latency, and if the system has any concurrency, some of the costs may overlap and you may have to do slightly more complicated analysis to estimate the latency.

The following table, which is an updated version of a table from a 2007 talk at Stanford University (video of the 2007 talk no longer exists, but there is a video of a related 2011 Stanford talk that covers some of the same content) may be useful since it lists the types of operations to consider, and their rough cost:

L1 cache reference                             0.5 ns
L2 cache reference                             3 ns
Branch mispredict                              5 ns
Mutex lock/unlock (uncontended)               15 ns
Main memory reference                         50 ns
Compress 1K bytes with Snappy              1,000 ns
Read 4KB from SSD                         20,000 ns
Round trip within same datacenter         50,000 ns
Read 1MB sequentially from memory         64,000 ns
Read 1MB over 100 Gbps network           100,000 ns
Read 1MB from SSD                      1,000,000 ns
Disk seek                              5,000,000 ns
Read 1MB sequentially from disk       10,000,000 ns
Send packet CA->Netherlands->CA      150,000,000 ns

The preceding table contains rough costs for some basic low-level operations. You may find it useful to also track estimated costs for higher-level operations relevant to your system. E.g., you might want to know the rough cost of a point read from your SQL database, the latency of interacting with a Cloud service, or the time to render a simple HTML page. If you don’t know the relevant cost of different operations, you can’t do decent back-of-the-envelope calculations!

Example: Time to quicksort a billion 4 byte numbers

As a rough approximation, a good quicksort algorithm makes log(N) passes over an array of size N. On each pass, the array contents will be streamed from memory into the processor cache, and the partition code will compare each element once to a pivot element. Let’s add up the dominant costs:

  1. Memory bandwidth: the array occupies 4 GB (4 bytes per number times a billion numbers). Let’s assume ~16GB/s of memory bandwidth per core. That means each pass will take ~0.25s. N is ~2^30, so we will make ~30 passes, so the total cost of memory transfer will be ~7.5 seconds.
  2. Branch mispredictions: we will do a total of N*log(N) comparisons, i.e., ~30 billion comparisons. Let’s assume that half of them (i.e., 15 billion) are mispredicted. Multiplying by 5 ns per misprediction, we get a misprediction cost of 75 seconds. We assume for this analysis that correctly predicted branches are free.
  3. Adding up the previous numbers, we get an estimate of ~82.5 seconds.

If necessary, we could refine our analysis to account for processor caches. This refinement is probably not needed since branch mispredictions are the dominant cost according to the analysis above, but we include it here anyway as another example. Let’s assume we have a 32MB L3 cache, and that the cost of transferring data from L3 cache to the processor is negligible. The L3 cache can hold 2^23 numbers, and therefore the last 22 passes can operate on the data resident in the L3 cache (the 23rd last pass brings data into the L3 cache and the remaining passes operate on that data.) That cuts down the memory transfer cost to 2.5 seconds (10 memory transfers of 4GB at 16GB/s) instead of 7.5 seconds (30 memory transfers).

Example: Time to generate a web page with 30 image thumbnails

Let’s compare two potential designs where the original images are stored on disk, and each image is approximately 1MB in size.

  1. Read the contents of the 30 images serially and generate a thumbnail for each one. Each read takes one seek + one transfer, which adds up to 5ms for the seek, and 10ms for the transfer, which adds up to 30 images times 15ms per image, i.e., 450ms.
  2. Read in parallel, assuming the images are spread evenly across K disks. The previous resource usage estimate still holds, but latency will drop by roughly a factor of K, ignoring variance (e.g, we will sometimes get unlucky and one disk will have more than 1/Kth of the images we are reading). Therefore if we are running on a distributed filesystem with hundreds of disks, the expected latency will drop to ~15ms.
  3. Let’s consider a variant where all images are on a single SSD. This changes the sequential read performance to 20µs + 1ms per image, which adds up to ~30 ms overall.

Measurement

The preceding section gives some tips about how to think about performance when writing code without worrying too much about how to measure the performance impact of your choices. However, before you actually start making improvements, or run into a tradeoff involving various things like performance, simplicity, etc. you will want to measure or estimate potential performance benefits. Being able to measure things effectively is the number one tool you’ll want to have in your arsenal when doing performance-related work.

As an aside, it’s worth pointing out that profiling code that you’re unfamiliar with can also be a good way of getting a general sense of the structure of the codebase and how it operates. Examining the source code of heavily involved routines in the dynamic call graph of a program can give you a high level sense of “what happens” when running the code, which can then build your own confidence in making performance-improving changes in slightly unfamiliar code.

Profiling tools and tips

Many useful profiling tools are available. A useful tool to reach for first is pprof since it gives good high level performance information and is easy to use both locally and for code running in production. Also try perf if you want more detailed insight into performance.

Some tips for profiling:

  • Build production binaries with appropriate debugging information and optimization flags.
  • If you can, write a microbenchmark that covers the code you are improving. Microbenchmarks improve turnaround time when making performance improvements, help verify the impact of performance improvements, and can help prevent future performance regressions. However microbenchmarks can have pitfalls that make them non-representative of full system performance. Useful libraries for writing microbenchmarks: C++ Go Java.
  • Use a benchmark library to emit performance counter readings both for better precision, and to get more insight into program behavior.

  • Lock contention can often artificially lower CPU usage. Some mutex implementations provide support for profiling lock contention.
  • Use ML profilers for machine learning performance work .

What to do when profiles are flat

You will often run into situations where your CPU profile is flat (there is no obvious big contributor to slowness). This can often happen when all low-hanging fruit has been picked. Here are some tips to consider if you find yourself in this situation:

  • Don’t discount the value of many small optimizations! Making twenty separate 1% improvements in some subsystem is often eminently possible and collectively mean a pretty sizable improvement (work of this flavor often relies on having stable and high quality microbenchmarks). Some examples of these sorts of changes are in the changes that demonstrate multiple techniques section.
  • Find loops closer to the top of call stacks (flame graph view of a CPU profile can be helpful here). Potentially, the loop or the code it calls could be restructured to be more efficient. Some code that initially built a complicated graph structure incrementally by looping over nodes and edges of the input was changed to build the graph structure in one shot by passing it the entire input. This removed a bunch of internal checks that were happening per edge in the initial code.
  • Take a step back and look for structural changes higher up in the call stacks instead of concentrating on micro-optimizations. The techniques listed under algorithmic improvements can be useful when doing this.
  • Look for overly general code. Replace it with a customized or lower-level implementation. E.g., if an application is repeatedly using a regular expression match where a simple prefix match would suffice, consider dropping the use of the regular expression.
  • Attempt to reduce the number of allocations: get an allocation profile, and pick away at the highest contributor to the number of allocations. This will have two effects: (1) It will provide a direct reduction of the amount of time spent in the allocator (and garbage collector for GC-ed languages) (2) There will often be a reduction in cache misses since in a long running program using tcmalloc, every allocation tends to go to a different cache line.
  • Gather other types of profiles, specially ones based on hardware performance counters. Such profiles may point out functions that are encountering a high cache miss rate. Techniques described in the profiling tools and tips section can be helpful.

API considerations

Some of the techniques suggested below require changing data structures and function signatures, which may be disruptive to callers. Try to organize code so that the suggested performance improvements can be made inside an encapsulation boundary without affecting public interfaces. This will be easier if your modules are deep (significant functionality accessed via a narrow interface).

Widely used APIs come under heavy pressure to add features. Be careful when adding new features since these will constrain future implementations and increase cost unnecessarily for users who don’t need the new features. E.g., many C++ standard library containers promise iterator stability, which in typical implementations increases the number of allocations significantly, even though many users do not need pointer stability.

Some specific techniques are listed below. Consider carefully the performance benefits vs. any API usability issues introduced by such changes.

Bulk APIs

Provide bulk ops to reduce expensive API boundary crossings or to take advantage of algorithmic improvements.

Add bulk MemoryManager::LookupMany interface.

In addition to adding a bulk interface, this also simplified the signature for the new bulk variant: it turns out clients only needed to know if all the keys were found, so we can return a bool rather than a Status object.

memory_manager.h

class MemoryManager {
 public:
  ...
  util::StatusOr<LiveTensor> Lookup(const TensorIdProto& id);
class MemoryManager {
 public:
  ...
  util::StatusOr<LiveTensor> Lookup(const TensorIdProto& id);

  // Lookup the identified tensors
  struct LookupKey {
    ClientHandle client;
    uint64 local_id;
  };
  bool LookupMany(absl::Span<const LookupKey> keys,
                  absl::Span<tensorflow::Tensor> tensors);

Add bulk ObjectStore::DeleteRefs API to amortize locking overhead.

object_store.h

template <typename T>
class ObjectStore {
 public:
  ...
  absl::Status DeleteRef(Ref);
template <typename T>
class ObjectStore {
 public:
  ...
  absl::Status DeleteRef(Ref);

  // Delete many references.  For each ref, if no other Refs point to the same
  // object, the object will be deleted.  Returns non-OK on any error.
  absl::Status DeleteRefs(absl::Span<const Ref> refs);
  ...
template <typename T>
absl::Status ObjectStore<T>::DeleteRefs(absl::Span<const Ref> refs) {
  util::Status result;
  absl::MutexLock l(&mu_);
  for (auto ref : refs) {
    result.Update(DeleteRefLocked(ref));
  }
  return result;
}

memory_tracking.cc

void HandleBatch(int, const plaque::Batch& input) override {
  for (const auto& t : input) {
    auto in = In(t);
    PLAQUE_OP_ASSIGN_OR_RETURN(const auto& handles, in.handles());
    for (const auto handle : handles.value->handles()) {
      PLAQUE_OP_RETURN_IF_ERROR(in_buffer_store_
                                    ? bstore_->DeleteRef(handle)
                                    : tstore_->DeleteRef(handle));
    }
  }
}
void HandleBatch(int, const plaque::Batch& input) override {
  for (const auto& t : input) {
    auto in = In(t);
    PLAQUE_OP_ASSIGN_OR_RETURN(const auto& handles, in.handles());
    if (in_buffer_store_) {
      PLAQUE_OP_RETURN_IF_ERROR(
          bstore_->DeleteRefs(handles.value->handles()));
    } else {
      PLAQUE_OP_RETURN_IF_ERROR(
          tstore_->DeleteRefs(handles.value->handles()));
    }
  }
}

Use Floyd's heap construction for efficient initialization.

Bulk initialization of a heap can be done in O(N) time, whereas adding one element at a time and updating the heap property after each addition requires O(N lg(N)) time.

Sometimes it is hard to change callers to use a new bulk API directly. In that case it might be beneficial to use a bulk API internally and cache the results for use in future non-bulk API calls:

Cache block decode results for use in future calls.

Each lookup needs to decode a whole block of K entries. Store the decoded entries in a cache and consult the cache on future lookups.

lexicon.cc

void GetTokenString(int pos, std::string* out) const {
  ...
  absl::FixedArray<LexiconEntry, 32> entries(pos + 1);

  // Decode all lexicon entries up to and including pos.
  for (int i = 0; i <= pos; ++i) {
    p = util::coding::TwoValuesVarint::Decode32(p, &entries[i].remaining,
                                                &entries[i].shared);
    entries[i].remaining_str = p;
    p += entries[i].remaining;  // remaining bytes trail each entry.
  }
mutable std::vector<absl::InlinedVector<std::string, 16>> cache_;
...
void GetTokenString(int pos, std::string* out) const {
  ...
  DCHECK_LT(skentry, cache_.size());
  if (!cache_[skentry].empty()) {
    *out = cache_[skentry][pos];
    return;
  }
  ...
  // Init cache.
  ...
  const char* prev = p;
  for (int i = 0; i < block_sz; ++i) {
    uint32 shared, remaining;
    p = TwoValuesVarint::Decode32(p, &remaining, &shared);
    auto& cur = cache_[skentry].emplace_back();
    gtl::STLStringResizeUninitialized(&cur, remaining + shared);

    std::memcpy(cur.data(), prev, shared);
    std::memcpy(cur.data() + shared, p, remaining);
    prev = cur.data();
    p += remaining;
  }
  *out = cache_[skentry][pos];

View types

Prefer view types (e.g., std::string_view, std::Span<T>, absl::FunctionRef<R(Args...)>) for function arguments (unless ownership of the data is being transferred). These types reduce copying, and allow callers to pick their own container types (e.g., one caller might use std::vector whereas another one uses absl::InlinedVector).

Pre-allocated/pre-computed arguments

For frequently called routines, sometimes it is useful to allow higher-level callers to pass in a data structure that they own or information that the called routine needs that the client already has. This can avoid the low-level routine being forced to allocate its own temporary data structure or recompute already-available information.

Add RPC_Stats::RecordRPC variant allowing client to pass in already available WallTime value.

rpc-stats.h

static void RecordRPC(const Name &name, const RPC_Stats_Measurement& m);
static void RecordRPC(const Name &name, const RPC_Stats_Measurement& m,
                      WallTime now);

clientchannel.cc

const WallTime now = WallTime_Now();
...
RPC_Stats::RecordRPC(stats_name, m);
const WallTime now = WallTime_Now();
...
RPC_Stats::RecordRPC(stats_name, m, now);

Thread-compatible vs. Thread-safe types

A type may be either thread-compatible (synchronized externally) or thread-safe (synchronized internally). Most generally used types should be thread-compatible. This way callers who do not need thread-safety don’t pay for it.

Make a class thread-compatible since callers are already synchronized.

hitless-transfer-phase.cc

TransferPhase HitlessTransferPhase::get() const {
  static CallsiteMetrics cm("HitlessTransferPhase::get");
  MonitoredMutexLock l(&cm, &mutex_);
  return phase_;
}
TransferPhase HitlessTransferPhase::get() const { return phase_; }

hitless-transfer-phase.cc

bool HitlessTransferPhase::AllowAllocate() const {
  static CallsiteMetrics cm("HitlessTransferPhase::AllowAllocate");
  MonitoredMutexLock l(&cm, &mutex_);
  return phase_ == TransferPhase::kNormal || phase_ == TransferPhase::kBrownout;
}
bool HitlessTransferPhase::AllowAllocate() const {
  return phase_ == TransferPhase::kNormal || phase_ == TransferPhase::kBrownout;
}

However if the typical use of a type needs synchronization, prefer to move the synchronization inside the type. This allows the synchronization mechanism to be tweaked as necessary to improve performance (e.g., sharding to reduce contention) without affecting callers.

Algorithmic improvements

The most critical opportunities for performance improvements come from algorithmic improvements, e.g., turning an O(N²) algorithm to O(N lg(N)) or O(N), avoiding potentially exponential behavior, etc. These opportunities are rare in stable code, but are worth paying attention to when writing new code. A few examples that show such improvements to pre-existing code:

Add nodes to cycle detection structure in reverse post-order.

We were previously adding graph nodes and edges one at a time to a cycle-detection data structure, which required expensive work per edge. We now add the entire graph in reverse post-order, which makes cycle-detection trivial.

graphcycles.h

class GraphCycles : public util_graph::Graph {
 public:
  GraphCycles();
  ~GraphCycles() override;

  using Node = util_graph::Node;
class GraphCycles : public util_graph::Graph {
 public:
  GraphCycles();
  ~GraphCycles() override;

  using Node = util_graph::Node;

  // InitFrom adds all the nodes and edges from src, returning true if
  // successful, false if a cycle is encountered.
  // REQUIRES: no nodes and edges have been added to GraphCycles yet.
  bool InitFrom(const util_graph::Graph& src);

graphcycles.cc

bool GraphCycles::InitFrom(const util_graph::Graph& src) {
  ...
  // Assign ranks in topological order so we don't need any reordering during
  // initialization. For an acyclic graph, DFS leaves nodes in reverse
  // topological order, so we assign decreasing ranks to nodes as we leave them.
  Rank last_rank = n;
  auto leave = [&](util_graph::Node node) {
    DCHECK(r->rank[node] == kMissingNodeRank);
    NodeInfo* nn = &r->nodes[node];
    nn->in = kNil;
    nn->out = kNil;
    r->rank[node] = --last_rank;
  };
  util_graph::DFSAll(src, std::nullopt, leave);

  // Add all the edges (detect cycles as we go).
  bool have_cycle = false;
  util_graph::PerEdge(src, [&](util_graph::Edge e) {
    DCHECK_NE(r->rank[e.src], kMissingNodeRank);
    DCHECK_NE(r->rank[e.dst], kMissingNodeRank);
    if (r->rank[e.src] >= r->rank[e.dst]) {
      have_cycle = true;
    } else if (!HasEdge(e.src, e.dst)) {
      EdgeListAddNode(r, &r->nodes[e.src].out, e.dst);
      EdgeListAddNode(r, &r->nodes[e.dst].in, e.src);
    }
  });
  if (have_cycle) {
    return false;
  } else {
    DCHECK(CheckInvariants());
    return true;
  }
}

graph_partitioner.cc

absl::Status MergeGraph::Init() {
  const Graph& graph = *compiler_->graph();
  clusters_.resize(graph.NodeLimit());
  graph.PerNode([&](Node node) {
    graph_->AddNode(node);
    NodeList* n = new NodeList;
    n->push_back(node);
    clusters_[node] = n;
  });
  absl::Status s;
  PerEdge(graph, [&](Edge e) {
    if (!s.ok()) return;
    if (graph_->HasEdge(e.src, e.dst)) return;  // already added
    if (!graph_->InsertEdge(e.src, e.dst)) {
      s = absl::InvalidArgumentError("cycle in the original graph");
    }
  });
  return s;
}
absl::Status MergeGraph::Init() {
  const Graph& graph = *compiler_->graph();
  if (!graph_->InitFrom(graph)) {
    return absl::InvalidArgumentError("cycle in the original graph");
  }
  clusters_.resize(graph.NodeLimit());
  graph.PerNode([&](Node node) {
    NodeList* n = new NodeList;
    n->push_back(node);
    clusters_[node] = n;
  });
  return absl::OkStatus();
}

Replace the deadlock detection system built into a mutex implementation with a better algorithm.

Replaced deadlock detection algorithm by one that is ~50x as fast and scales to millions of mutexes without problem (the old algorithm relied on a 2K limit to avoid a performance cliff). The new code is based on the following paper: A dynamic topological sort algorithm for directed acyclic graphs David J. Pearce, Paul H. J. Kelly Journal of Experimental Algorithmics (JEA) JEA Homepage archive Volume 11, 2006, Article No. 1.7

The new algorithm takes O(|V|+|E|) space (instead of the O(|V|^2) bits needed by the older algorithm). Lock-acquisition order graphs are very sparse, so this is much less space. The algorithm is also quite simple: the core of it is ~100 lines of C++. Since the code now scales to much larger number of Mutexes, we were able to relax an artificial 2K limit, which uncovered a number of latent deadlocks in real programs.

Benchmark results: these were run in DEBUG mode since deadlock detection is mainly enabled in debug mode. The benchmark argument (/2k etc.) is the number of tracked nodes. At the default 2k limit of the old algorithm, the new algorithm takes only 0.5 microseconds per InsertEdge compared to 22 microseconds for the old algorithm. The new algorithm also easily scales to much larger graphs without problems whereas the old algorithm keels over quickly.

DEBUG: Benchmark            Time(ns)    CPU(ns) Iterations
----------------------------------------------------------
DEBUG: BM_StressTest/2k        23553      23566      29086
DEBUG: BM_StressTest/4k        45879      45909      15287
DEBUG: BM_StressTest/16k      776938     777472        817
DEBUG: BM_StressTest/2k          392        393   10485760
DEBUG: BM_StressTest/4k          392        393   10485760
DEBUG: BM_StressTest/32k         407        407   10485760
DEBUG: BM_StressTest/256k        456        456   10485760
DEBUG: BM_StressTest/1M          534        534   10485760

Replace an IntervalMap (with O(lg N) lookups) with a hash table (O(1) lookups).

The initial code was using IntervalMap because it seemed like the right data structure to support coalescing of adjacent blocks, but a hash table suffices since the adjacent block can be found by a hash table lookup. This (plus other changes in the CL) improve the performance of tpu::BestFitAllocator by ~4X.

best_fit_allocator.h

using Block = gtl::IntervalMap<int64, BlockState>::Entry;
...
// Map of pairs (address range, BlockState) with one entry for each allocation
// covering the range [0, allocatable_range_end_).  Adjacent kFree and
// kReserved blocks are coalesced. Adjacent kAllocated blocks are not
// coalesced.
gtl::IntervalMap<int64, BlockState> block_list_;

// Set of all free blocks sorted according to the allocation policy. Adjacent
// free blocks are coalesced.
std::set<Block, BlockSelector> free_list_;
// A faster hash function for offsets in the BlockTable
struct OffsetHash {
  ABSL_ATTRIBUTE_ALWAYS_INLINE size_t operator()(int64 value) const {
    uint64 m = value;
    m *= uint64_t{0x9ddfea08eb382d69};
    return static_cast<uint64_t>(m ^ (m >> 32));
  }
};

// Hash table maps from block start address to block info.
// We include the length of the previous block in this info so we
// can find the preceding block to coalesce with.
struct HashTableEntry {
  BlockState state;
  int64 my_length;
  int64 prev_length;  // Zero if there is no previous block.
};
using BlockTable = absl::flat_hash_map<int64, HashTableEntry, OffsetHash>;

Replace sorted-list intersection (O(N log N)) with hash table lookups (O(N)).

Old code to detect whether or not two nodes share a common source would get the sources for each node in sorted order and then do a sorted intersection. The new code places the sources for one node in a hash-table and then iterates over the other node's sources checking the hash-table.

name             old time/op  new time/op  delta
BM_CompileLarge   28.5s ± 2%   22.4s ± 2%  -21.61%  (p=0.008 n=5+5)

Implement good hash function so that things are O(1) instead of O(N).

location.h

// Hasher for Location objects.
struct LocationHash {
  size_t operator()(const Location* key) const {
    return key != nullptr ? util_hash::Hash(key->address()) : 0;
  }
};
size_t HashLocation(const Location& loc);
...
struct LocationHash {
  size_t operator()(const Location* key) const {
    return key != nullptr ? HashLocation(*key) : 0;
  }
};

location.cc

size_t HashLocation(const Location& loc) {
  util_hash::MurmurCat m;

  // Encode some simpler features into a single value.
  m.AppendAligned((loc.dynamic() ? 1 : 0)                    //
                  | (loc.append_shard_to_address() ? 2 : 0)  //
                  | (loc.is_any() ? 4 : 0)                   //
                  | (!loc.any_of().empty() ? 8 : 0)          //
                  | (loc.has_shardmap() ? 16 : 0)            //
                  | (loc.has_sharding() ? 32 : 0));

  if (loc.has_shardmap()) {
    m.AppendAligned(loc.shardmap().output() |
                    static_cast<uint64_t>(loc.shardmap().stmt()) << 20);
  }
  if (loc.has_sharding()) {
    uint64_t num = 0;
    switch (loc.sharding().type_case()) {
      case Sharding::kModShard:
        num = loc.sharding().mod_shard();
        break;
      case Sharding::kRangeSplit:
        num = loc.sharding().range_split();
        break;
      case Sharding::kNumShards:
        num = loc.sharding().num_shards();
        break;
      default:
        num = 0;
        break;
    }
    m.AppendAligned(static_cast<uint64_t>(loc.sharding().type_case()) |
                    (num << 3));
  }

  auto add_string = [&m](absl::string_view s) {
    if (!s.empty()) {
      m.Append(s.data(), s.size());
    }
  };

  add_string(loc.address());
  add_string(loc.lb_policy());

  // We do not include any_of since it is complicated to compute a hash
  // value that is not sensitive to order and duplication.
  return m.GetHash();
}

Better memory representation

Careful consideration of memory footprint and cache footprint of important data structures can often yield big savings. The data structures below focus on supporting common operations by touching fewer cache lines. Care taken here can (a) avoid expensive cache misses (b) reduce memory bus traffic, which speeds up both the program in question and anything else running on the same machine. They rely on some common techniques you may find useful when designing your own data structures.

Compact data structures

Use compact representations for data that will be accessed often or that comprises a large portion of the application’s memory usage. A compact representation can significantly reduce memory usage and improve performance by touching fewer cache lines and reducing memory bus bandwidth usage. However, watch out for cache-line contention.

Memory layout

Carefully consider the memory layout of types that have a large memory or cache footprint.

  • Reorder fields to reduce padding between fields with different alignment requirements (see class layout discussion).
  • Use smaller numeric types where the stored data will fit in the smaller type.
  • Enum values sometimes take up a whole word unless you’re careful. Consider using a smaller representation (e.g., use enum class OpType : uint8_t { ... } instead of enum class OpType { ... }).
  • Order fields so that fields that are frequently accessed together are closer to each other – this will reduce the number of cache lines touched on common operations.
  • Place hot read-only fields away from hot mutable fields so that writes to the mutable fields do not cause the read-only fields to be evicted from nearby caches.
  • Move cold data so it does not live next to hot data, either by placing the cold data at the end of the struct, or behind a level of indirection, or in a separate array.
  • Consider packing things into fewer bytes by using bit and byte-level encoding. This can be complicated, so only do this when the data under question is encapsulated inside a well-tested module, and the overall reduction of memory usage is significant. Furthermore, watch out for side effects like under-alignment of frequently used data, or more expensive code for accessing packed representations. Validate such changes using benchmarks.

Indices instead of pointers

On modern 64-bit machines, pointers take up 64 bits. If you have a pointer-rich data structure, you can easily chew up lots of memory with indirections of T*. Instead, consider using integer indices into an array T[] or other data structure. Not only will the references be smaller (if the number of indices is small enough to fit in 32 or fewer bits), but the storage for all the T[] elements will be contiguous, often leading to better cache locality.

Batched storage

Avoid data structures that allocate a separate object per stored element (e.g., std::map, std::unordered_map in C++). Instead, consider types that use chunked or flat representations to store multiple elements in close proximity in memory (e.g., std::vector, absl::flat_hash_{map,set} in C++). Such types tend to have much better cache behavior. Furthermore, they encounter less allocator overhead.

One useful technique is to partition elements into chunks where each chunk can hold a fixed number of elements. This technique can reduce the cache footprint of a data structure significantly while preserving good asymptotic behavior.

For some data structures, a single chunk suffices to hold all elements (e.g., strings and vectors). Other types (e.g., absl::flat_hash_map) also use this technique.

Inlined storage

Some container types are optimized for storing a small number of elements. These types provide space for a small number of elements at the top level and completely avoid allocations when the number of elements is small. This can be very helpful when instances of such types are constructed often (e.g., as stack variables in frequently executed code), or if many instances are live at the same time. If a container will typically contain a small number of elements consider using one of the inlined storage types, e.g., InlinedVector.

Caveat: if sizeof(T) is large, inlined storage containers may not be the best choice since the inlined backing store will be large.

Unnecessarily nested maps

Sometimes a nested map data structure can be replaced with a single-level map with a compound key. This can reduce the cost of lookups and insertions significantly.

Reduce allocations and improve cache footprint by converting btree<a,btree<b,c>> to btree<pair<a,b>,c>.

graph_splitter.cc

absl::btree_map<std::string, absl::btree_map<std::string, OpDef>> ops;
// The btree maps from {package_name, op_name} to its const Opdef*.
absl::btree_map<std::pair<absl::string_view, absl::string_view>,
                const OpDef*>
    ops;

Caveat: if the first map key is big, it might be better to stick with nested maps:

Switch to a nested map leads to 76% performance improvement in microbenchmark.

We previously had a single-level hash table where the key consisted of a (string) path and some other numeric sub-keys. Each path occurred in approximately 1000 keys on average. We split the hash table into two levels where the first level was keyed by the path and each second level hash table kept just the sub-key to data mapping for a particular path. This reduced the memory usage for storing paths by a factor of 1000, and also sped up accesses where many sub-keys for the same path were accessed together.

Arenas

Arenas can help reduce memory allocation cost, but they also have the benefit of packing together independently allocated items next to each other, typically in fewer cache lines, and eliminating most destruction costs. They are likely most effective for complex data structures with many sub-objects. Consider providing an appropriate initial size for the arena since that can help reduce allocations.

Caveat: it is easy to misuse arenas by putting too many short-lived objects in a long-lived arena, which can unnecessarily bloat memory footprint.

Arrays instead of maps

If the domain of a map can be represented by a small integer or is an enum, or if the map will have very few elements, the map can sometimes be replaced by an array or a vector of some form.

Use an array instead of flat_map.

rtp_controller.h

const gtl::flat_map<int, int> payload_type_to_clock_frequency_;
// A map (implemented as a simple array) indexed by payload_type to clock freq
// for that paylaod type (or 0)
struct PayloadTypeToClockRateMap {
  int map[128];
};
...
const PayloadTypeToClockRateMap payload_type_to_clock_frequency_;

Bit vectors instead of sets

If the domain of a set can be represented by a small integer, the set can be replaced with a bit vector (InlinedBitVector is often a good choice). Set operations can also be nicely efficient on these representations using bitwise boolean operations (OR for union, AND for intersection, etc.).

Spanner placement system. Replace dense_hash_set<ZoneId> with a bit-vector with one bit per zone.

zone_set.h

class ZoneSet: public dense_hash_set<ZoneId> {
 public:
  ...
  bool Contains(ZoneId zone) const {
    return count(zone) > 0;
  }
class ZoneSet {
  ...
  // Returns true iff "zone" is contained in the set
  bool ContainsZone(ZoneId zone) const {
    return zone < b_.size() && b_.get_bit(zone);
  }
  ...
 private:
  int size_;          // Number of zones inserted
  util::bitmap::InlinedBitVector<256> b_;

Benchmark results:

CPU: AMD Opteron (4 cores) dL1:64KB dL2:1024KB
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_Evaluate/1                            960       676    +29.6%
BM_Evaluate/2                           1661      1138    +31.5%
BM_Evaluate/3                           2305      1640    +28.9%
BM_Evaluate/4                           3053      2135    +30.1%
BM_Evaluate/5                           3780      2665    +29.5%
BM_Evaluate/10                          7819      5739    +26.6%
BM_Evaluate/20                         17922     12338    +31.2%
BM_Evaluate/40                         36836     26430    +28.2%

Use bit matrix to keep track of reachability properties between operands instead of hash table.

hlo_computation.h

using TransitiveOperandMap =
    std::unordered_map<const HloInstruction*,
                       std::unordered_set<const HloInstruction*>>;
class HloComputation::ReachabilityMap {
  ...
  // dense id assignment from HloInstruction* to number
  tensorflow::gtl::FlatMap<const HloInstruction*, int> ids_;
  // matrix_(a,b) is true iff b is reachable from a
  tensorflow::core::Bitmap matrix_;
};

Reduce allocations

Memory allocation adds costs:

  1. It increases the time spent in the allocator.
  2. Newly-allocated objects may require expensive initialization and sometimes corresponding expensive destruction when no longer needed.
  3. Every allocation tends to be on a new cache line and therefore data spread across many independent allocations will have a larger cache footprint than data spread across fewer allocations.

Garbage-collection runtimes sometimes obviate issue #3 by placing consecutive allocations sequentially in memory.

Avoid unnecessary allocations

Reducing allocations increases benchmark throughput by 21%.

memory_manager.cc

LiveTensor::LiveTensor(tf::Tensor t, std::shared_ptr<const DeviceInfo> dinfo,
                       bool is_batched)
    : tensor(std::move(t)),
      device_info(dinfo ? std::move(dinfo) : std::make_shared<DeviceInfo>()),
      is_batched(is_batched) {
static const std::shared_ptr<DeviceInfo>& empty_device_info() {
  static std::shared_ptr<DeviceInfo>* result =
      new std::shared_ptr<DeviceInfo>(new DeviceInfo);
  return *result;
}

LiveTensor::LiveTensor(tf::Tensor t, std::shared_ptr<const DeviceInfo> dinfo,
                       bool is_batched)
    : tensor(std::move(t)), is_batched(is_batched) {
  if (dinfo) {
    device_info = std::move(dinfo);
  } else {
    device_info = empty_device_info();
  }

Use statically-allocated zero vector when possible rather than allocating a vector and filling it with zeroes.

embedding_executor_8bit.cc

// The actual implementation of the EmbeddingLookUpT using template parameters
// instead of object members to improve the performance.
template <bool Mean, bool SymmetricInputRange>
static tensorflow::Status EmbeddingLookUpT(...) {
    ...
  std::unique_ptr<tensorflow::quint8[]> zero_data(
      new tensorflow::quint8[max_embedding_width]);
  memset(zero_data.get(), 0, sizeof(tensorflow::quint8) * max_embedding_width);
// A size large enough to handle most embedding widths
static const int kTypicalMaxEmbedding = 256;
static tensorflow::quint8 static_zero_data[kTypicalMaxEmbedding];  // All zeroes
...
// The actual implementation of the EmbeddingLookUpT using template parameters
// instead of object members to improve the performance.
template <bool Mean, bool SymmetricInputRange>
static tensorflow::Status EmbeddingLookUpT(...) {
    ...
  std::unique_ptr<tensorflow::quint8[]> zero_data_backing(nullptr);

  // Get a pointer to a memory area with at least
  // "max_embedding_width" quint8 zero values.
  tensorflow::quint8* zero_data;
  if (max_embedding_width <= ARRAYSIZE(static_zero_data)) {
    // static_zero_data is big enough so we don't need to allocate zero data
    zero_data = &static_zero_data[0];
  } else {
    // static_zero_data is not big enough: we need to allocate zero data
    zero_data_backing =
        absl::make_unique<tensorflow::quint8[]>(max_embedding_width);
    memset(zero_data_backing.get(), 0,
           sizeof(tensorflow::quint8) * max_embedding_width);
    zero_data = zero_data_backing.get();
  }

Also, prefer stack allocation over heap allocation when object lifetime is bounded by the scope (although be careful with stack frame sizes for large objects).

Resize or reserve containers

When the maximum or expected maximum size of a vector (or some other container types) is known in advance, pre-size the container’s backing store (e.g., using resize or reserve in C++).

Pre-size a vector and fill it in, rather than N push_back operations.

indexblockdecoder.cc

for (int i = 0; i < ndocs-1; i++) {
  uint32 delta;
  ERRORCHECK(b->GetRice(rice_base, &delta));
  docs_.push_back(DocId(my_shard_ + (base + delta) * num_shards_));
  base = base + delta + 1;
}
docs_.push_back(last_docid_);
docs_.resize(ndocs);
DocId* docptr = &docs_[0];
for (int i = 0; i < ndocs-1; i++) {
  uint32 delta;
  ERRORCHECK(b.GetRice(rice_base, &delta));
  *docptr = DocId(my_shard_ + (base + delta) * num_shards_);
  docptr++;
  base = base + delta + 1;
}
*docptr = last_docid_;

Caveat: Do not use resize or reserve to grow one element at a time since that may lead to quadratic behavior. Also, if element construction is expensive, prefer an initial reserve call followed by several push_back or emplace_back calls instead of an initial resize since that will double the number of constructor calls.

Avoid copying when possible

  • Prefer moving to copying data structures when possible.
  • If lifetime is not an issue, store pointers or indices instead of copies of objects in transient data structures. E.g., if a local map is used to select a set of protos from an incoming list of protos, we can make the map store just pointers to the incoming protos instead of copying potentially deeply nested data. Another common example is sorting a vector of indices rather than sorting a vector of large objects directly since the latter would incur significant copying/moving costs.

Avoid an extra copy when receiving a tensor via gRPC.

A benchmark that sends around 400KB tensors speeds up by ~10-15%:

Benchmark              Time(ns)    CPU(ns) Iterations
-----------------------------------------------------
BM_RPC/30/98k_mean    148764691 1369998944       1000
Benchmark              Time(ns)    CPU(ns) Iterations
-----------------------------------------------------
BM_RPC/30/98k_mean    131595940 1216998084       1000

Move large options structure rather than copying it.

index.cc

return search_iterators::DocPLIteratorFactory::Create(opts);
return search_iterators::DocPLIteratorFactory::Create(std::move(opts));

Use std::sort instead of std::stable_sort, which avoids an internal copy inside the stable sort implementation.

encoded-vector-hits.h

std::stable_sort(hits_.begin(), hits_.end(),
                 gtl::OrderByField(&HitWithPayloadOffset::docid));
struct HitWithPayloadOffset {
  search_iterators::LocalDocId64 docid;
  int first_payload_offset;  // offset into the payload vector.
  int num_payloads;

  bool operator<(const HitWithPayloadOffset& other) const {
    return (docid < other.docid) ||
           (docid == other.docid &&
            first_payload_offset < other.first_payload_offset);
  }
};
    ...
    std::sort(hits_.begin(), hits_.end());

Reuse temporary objects

A container or an object declared inside a loop will be recreated on every loop iteration. This can lead to expensive construction, destruction, and resizing. Hoisting the declaration outside the loop enables reuse and can provide a significant performance boost. (Compilers are often unable to do such hoisting on their own due to language semantics or their inability to ensure program equivalence.)

Hoist variable definition outside of loop iteration.

autofdo_profile_utils.h

auto iterator = absl::WrapUnique(sstable->GetIterator());
while (!iterator->done()) {
  T profile;
  if (!profile.ParseFromString(iterator->value_view())) {
    return absl::InternalError(
        "Failed to parse mem_block to specified profile type.");
  }
  ...
  iterator->Next();
}
auto iterator = absl::WrapUnique(sstable->GetIterator());
T profile;
while (!iterator->done()) {
  if (!profile.ParseFromString(iterator->value_view())) {
    return absl::InternalError(
        "Failed to parse mem_block to specified profile type.");
  }
  ...
  iterator->Next();
}

Define a protobuf variable outside a loop so that its allocated storage can be reused across loop iterations.

stats-router.cc

for (auto& r : routers_to_update) {
  ...
  ResourceRecord record;
  {
    MutexLock agg_lock(r.agg->mutex());
    r.agg->AddResourceRecordUsages(measure_indices, &record);
  }
  ...
}
ResourceRecord record;
for (auto& r : routers_to_update) {
  ...
  record.Clear();
  {
    MutexLock agg_lock(r.agg->mutex());
    r.agg->AddResourceRecordUsages(measure_indices, &record);
  }
  ...
}

Serialize to same std::string repeatedly.

program_rep.cc

std::string DeterministicSerialization(const proto2::Message& m) {
  std::string result;
  proto2::io::StringOutputStream sink(&result);
  proto2::io::CodedOutputStream out(&sink);
  out.SetSerializationDeterministic(true);
  m.SerializePartialToCodedStream(&out);
  return result;
}
absl::string_view DeterministicSerializationTo(const proto2::Message& m,
                                               std::string* scratch) {
  scratch->clear();
  proto2::io::StringOutputStream sink(scratch);
  proto2::io::CodedOutputStream out(&sink);
  out.SetSerializationDeterministic(true);
  m.SerializePartialToCodedStream(&out);
  return absl::string_view(*scratch);
}

Caveat: protobuf, string, vector, containers etc. tend to grow to the size of the largest value ever stored in them. Therefore reconstructing them periodically (e.g., after every N uses) can help reduce memory requirements and reinitialization costs.

Avoid unnecessary work

Perhaps one of the most effective categories of improving performance is avoiding work you don’t have to do. This can take many forms, including creating specialized paths through code for common cases that avoid more general expensive computation, precomputation, deferring work until it is really needed, hoisting work into less-frequently executed pieces of code, and other similar approaches. Below are many examples of this general approach, categorized into a few representative categories.

Fast paths for common cases

Often, code is written to cover all cases, but some subset of the cases are much simpler and more common than others. E.g., vector::push_back usually has enough space for the new element, but contains code to resize the underlying storage when it does not. Some attention paid to the structure of code can help make the common simple case faster without hurting uncommon case performance significantly.

Make fast path cover more common cases.

Add handling of trailing single ASCII bytes, rather than only handling multiples of four bytes with this routine. This avoids calling the slower generic routine for all-ASCII strings that are, for example, 5 bytes.

utf8statetable.cc

// Scan a UTF-8 stringpiece based on state table.
// Always scan complete UTF-8 characters
// Set number of bytes scanned. Return reason for exiting
// OPTIMIZED for case of 7-bit ASCII 0000..007f all valid
int UTF8GenericScanFastAscii(const UTF8ScanObj* st, absl::string_view str,
                             int* bytes_consumed) {
                             ...
  int exit_reason;
  do {
    //  Skip 8 bytes of ASCII at a whack; no endianness issue
    while ((src_limit - src >= 8) &&
           (((UNALIGNED_LOAD32(src + 0) | UNALIGNED_LOAD32(src + 4)) &
             0x80808080) == 0)) {
      src += 8;
    }
    //  Run state table on the rest
    int rest_consumed;
    exit_reason = UTF8GenericScan(
        st, absl::ClippedSubstr(str, src - initial_src), &rest_consumed);
    src += rest_consumed;
  } while (exit_reason == kExitDoAgain);

  *bytes_consumed = src - initial_src;
  return exit_reason;
}
// Scan a UTF-8 stringpiece based on state table.
// Always scan complete UTF-8 characters
// Set number of bytes scanned. Return reason for exiting
// OPTIMIZED for case of 7-bit ASCII 0000..007f all valid
int UTF8GenericScanFastAscii(const UTF8ScanObj* st, absl::string_view str,
                             int* bytes_consumed) {
                             ...
  int exit_reason = kExitOK;
  do {
    //  Skip 8 bytes of ASCII at a whack; no endianness issue
    while ((src_limit - src >= 8) &&
           (((UNALIGNED_LOAD32(src + 0) | UNALIGNED_LOAD32(src + 4)) &
             0x80808080) == 0)) {
      src += 8;
    }
    while (src < src_limit && Is7BitAscii(*src)) { // Skip ASCII bytes
      src++;
    }
    if (src < src_limit) {
      //  Run state table on the rest
      int rest_consumed;
      exit_reason = UTF8GenericScan(
          st, absl::ClippedSubstr(str, src - initial_src), &rest_consumed);
      src += rest_consumed;
    }
  } while (exit_reason == kExitDoAgain);

  *bytes_consumed = src - initial_src;
  return exit_reason;
}

Simpler fast paths for InlinedVector.

inlined_vector.h

auto Storage<T, N, A>::Resize(ValueAdapter values, size_type new_size) -> void {
  StorageView storage_view = MakeStorageView();

  IteratorValueAdapter<MoveIterator> move_values(
      MoveIterator(storage_view.data));

  AllocationTransaction allocation_tx(GetAllocPtr());
  ConstructionTransaction construction_tx(GetAllocPtr());

  absl::Span<value_type> construct_loop;
  absl::Span<value_type> move_construct_loop;
  absl::Span<value_type> destroy_loop;

  if (new_size > storage_view.capacity) {
  ...
  } else if (new_size > storage_view.size) {
    construct_loop = {storage_view.data + storage_view.size,
                      new_size - storage_view.size};
  } else {
    destroy_loop = {storage_view.data + new_size, storage_view.size - new_size};
  }
auto Storage<T, N, A>::Resize(ValueAdapter values, size_type new_size) -> void {
  StorageView storage_view = MakeStorageView();
  auto* const base = storage_view.data;
  const size_type size = storage_view.size;
  auto* alloc = GetAllocPtr();
  if (new_size <= size) {
    // Destroy extra old elements.
    inlined_vector_internal::DestroyElements(alloc, base + new_size,
                                             size - new_size);
  } else if (new_size <= storage_view.capacity) {
    // Construct new elements in place.
    inlined_vector_internal::ConstructElements(alloc, base + size, &values,
                                               new_size - size);
  } else {
  ...
  }

Fast path for common cases of initializing 1-D to 4-D tensors.

tensor_shape.cc

template <class Shape>
TensorShapeBase<Shape>::TensorShapeBase(gtl::ArraySlice<int64> dim_sizes) {
  set_tag(REP16);
  set_data_type(DT_INVALID);
  set_ndims_byte(0);
  set_num_elements(1);
  for (int64 s : dim_sizes) {
    AddDim(internal::SubtleMustCopy(s));
  }
}
template <class Shape>
void TensorShapeBase<Shape>::InitDims(gtl::ArraySlice<int64> dim_sizes) {
  DCHECK_EQ(tag(), REP16);

  // Allow sizes that are under kint64max^0.25 so that 4-way multiplication
  // below cannot overflow.
  static const uint64 kMaxSmall = 0xd744;
  static_assert(kMaxSmall * kMaxSmall * kMaxSmall * kMaxSmall <= kint64max,
                "bad overflow check");
  bool large_size = false;
  for (auto s : dim_sizes) {
    if (s > kMaxSmall) {
      large_size = true;
      break;
    }
  }

  if (!large_size) {
    // Every size fits in 16 bits; use fast-paths for dims in {1,2,3,4}.
    uint16* dst = as16()->dims_;
    switch (dim_sizes.size()) {
      case 1: {
        set_ndims_byte(1);
        const int64 size = dim_sizes[0];
        const bool neg = Set16(kIsPartial, dst, 0, size);
        set_num_elements(neg ? -1 : size);
        return;
      }
      case 2: {
        set_ndims_byte(2);
        const int64 size0 = dim_sizes[0];
        const int64 size1 = dim_sizes[1];
        bool neg = Set16(kIsPartial, dst, 0, size0);
        neg |= Set16(kIsPartial, dst, 1, size1);
        set_num_elements(neg ? -1 : (size0 * size1));
        return;
      }
      case 3: {
      ...
      }
      case 4: {
      ...
      }
    }
  }

  set_ndims_byte(0);
  set_num_elements(1);
  for (int64 s : dim_sizes) {
    AddDim(internal::SubtleMustCopy(s));
  }
}

Make varint parser fast path cover just the 1-byte case, instead of covering 1-byte and 2-byte cases.

Reducing the size of the (inlined) fast path reduces code size and icache pressure, which leads to improved performance.

parse_context.h

template <typename T>
PROTOBUF_NODISCARD const char* VarintParse(const char* p, T* out) {
  auto ptr = reinterpret_cast<const uint8_t*>(p);
  uint32_t res = ptr[0];
  if (!(res & 0x80)) {
    *out = res;
    return p + 1;
  }
  uint32_t byte = ptr[1];
  res += (byte - 1) << 7;
  if (!(byte & 0x80)) {
    *out = res;
    return p + 2;
  }
  return VarintParseSlow(p, res, out);
}
template <typename T>
PROTOBUF_NODISCARD const char* VarintParse(const char* p, T* out) {
  auto ptr = reinterpret_cast<const uint8_t*>(p);
  uint32_t res = ptr[0];
  if (!(res & 0x80)) {
    *out = res;
    return p + 1;
  }
  return VarintParseSlow(p, res, out);
}

parse_context.cc

std::pair<const char*, uint32_t> VarintParseSlow32(const char* p,
                                                   uint32_t res) {
  for (std::uint32_t i = 2; i < 5; i++) {
  ...
}
...
std::pair<const char*, uint64_t> VarintParseSlow64(const char* p,
                                                   uint32_t res32) {
  uint64_t res = res32;
  for (std::uint32_t i = 2; i < 10; i++) {
  ...
}
std::pair<const char*, uint32_t> VarintParseSlow32(const char* p,
                                                   uint32_t res) {
  for (std::uint32_t i = 1; i < 5; i++) {
  ...
}
...
std::pair<const char*, uint64_t> VarintParseSlow64(const char* p,
                                                   uint32_t res32) {
  uint64_t res = res32;
  for (std::uint32_t i = 1; i < 10; i++) {
  ...
}

Skip significant work in RPC_Stats_Measurement addition if no errors have occurred.

rpc-stats.h

struct RPC_Stats_Measurement {
  ...
  double errors[RPC::NUM_ERRORS];
struct RPC_Stats_Measurement {
  ...
  double get_errors(int index) const { return errors[index]; }
  void set_errors(int index, double value) {
    errors[index] = value;
    any_errors_set = true;
  }
 private:
  ...
  // We make this private so that we can keep track of whether any of
  // these values have been set to non-zero values.
  double errors[RPC::NUM_ERRORS];
  bool any_errors_set;  // True iff any of the errors[i] values are non-zero

rpc-stats.cc

void RPC_Stats_Measurement::operator+=(const RPC_Stats_Measurement& x) {
  ...
  for (int i = 0; i < RPC::NUM_ERRORS; ++i) {
    errors[i] += x.errors[i];
  }
}
void RPC_Stats_Measurement::operator+=(const RPC_Stats_Measurement& x) {
  ...
  if (x.any_errors_set) {
    for (int i = 0; i < RPC::NUM_ERRORS; ++i) {
      errors[i] += x.errors[i];
    }
    any_errors_set = true;
  }
}

Do array lookup on first byte of string to often avoid fingerprinting full string.

soft-tokens-helper.cc

bool SoftTokensHelper::IsSoftToken(const StringPiece& token) const {
  return soft_tokens_.find(Fingerprint(token.data(), token.size())) !=
      soft_tokens_.end();
}

soft-tokens-helper.h

class SoftTokensHelper {
 ...
 private:
  ...
  // Since soft tokens are mostly punctuation-related, for performance
  // purposes, we keep an array filter_.  filter_[i] is true iff any
  // of the soft tokens start with the byte value 'i'.  This avoids
  // fingerprinting a term in the common case, since we can just do an array
  // lookup based on the first byte, and if filter_[b] is false, then
  // we can return false immediately.
  bool          filter_[256];
  ...
};

inline bool SoftTokensHelper::IsSoftToken(const StringPiece& token) const {
  if (token.size() >= 1) {
    char first_char = token.data()[0];
    if (!filter_[first_char]) {
      return false;
    }
  }
  return IsSoftTokenFallback(token);
}

soft-tokens-helper.cc

bool SoftTokensHelper::IsSoftTokenFallback(const StringPiece& token) const {
  return soft_tokens_.find(Fingerprint(token.data(), token.size())) !=
      soft_tokens_.end();
}

Precompute expensive information once

Precompute a TensorFlow graph execution node property that allows us to quickly rule out certain unusual cases.

executor.cc

struct NodeItem {
  ...
  bool kernel_is_expensive = false;  // True iff kernel->IsExpensive()
  bool kernel_is_async = false;      // True iff kernel->AsAsync() != nullptr
  bool is_merge = false;             // True iff IsMerge(node)
  ...
  if (IsEnter(node)) {
  ...
  } else if (IsExit(node)) {
  ...
  } else if (IsNextIteration(node)) {
  ...
  } else {
    // Normal path for most nodes
    ...
  }
struct NodeItem {
  ...
  bool kernel_is_expensive : 1;  // True iff kernel->IsExpensive()
  bool kernel_is_async : 1;      // True iff kernel->AsAsync() != nullptr
  bool is_merge : 1;             // True iff IsMerge(node)
  bool is_enter : 1;             // True iff IsEnter(node)
  bool is_exit : 1;              // True iff IsExit(node)
  bool is_control_trigger : 1;   // True iff IsControlTrigger(node)
  bool is_sink : 1;              // True iff IsSink(node)
  // True iff IsEnter(node) || IsExit(node) || IsNextIteration(node)
  bool is_enter_exit_or_next_iter : 1;
  ...
  if (!item->is_enter_exit_or_next_iter) {
    // Fast path for nodes types that don't need special handling
    DCHECK_EQ(input_frame, output_frame);
    ...
  } else if (item->is_enter) {
  ...
  } else if (item->is_exit) {
  ...
  } else {
    DCHECK(IsNextIteration(node));
    ...
  }

Precompute 256 element array and use during trigram initialization.

byte_trigram_classifier.cc

void ByteTrigramClassifier::VerifyModel(void) const {
  ProbT class_sums[num_classes_];
  for (int cls = 0; cls < num_classes_; cls++) {
    class_sums[cls] = 0;
  }
  for (ByteNgramId id = 0; id < trigrams_.num_trigrams(); id++) {
    for (int cls = 0; cls < num_classes_; ++cls) {
      class_sums[cls] += Prob(trigram_probs_[id].log_probs[cls]);
    }
  }
  ...
}                         
void ByteTrigramClassifier::VerifyModel(void) const {
  CHECK_EQ(sizeof(ByteLogProbT), 1);
  ProbT fast_prob[256];
  for (int b = 0; b < 256; b++) {
    fast_prob[b] = Prob(static_cast<ByteLogProbT>(b));
  }

  ProbT class_sums[num_classes_];
  for (int cls = 0; cls < num_classes_; cls++) {
    class_sums[cls] = 0;
  }
  for (ByteNgramId id = 0; id < trigrams_.num_trigrams(); id++) {
    for (int cls = 0; cls < num_classes_; ++cls) {
      class_sums[cls] += fast_prob[trigram_probs_[id].log_probs[cls]];
    }
  }
  ...
}                         

General advice: check for malformed inputs at module boundaries instead of repeating checks internally.

Move expensive computations outside loops

Move bounds computation outside loop.

literal_linearizer.cc

for (int64 i = 0; i < src_shape.dimensions(dimension_numbers.front());
     ++i) {
int64 dim_front = src_shape.dimensions(dimension_numbers.front());
const uint8* src_buffer_data = src_buffer.data();
uint8* dst_buffer_data = dst_buffer.data();
for (int64 i = 0; i < dim_front; ++i) {

Defer expensive computation

Defer GetSubSharding call until needed, which reduces 43 seconds of CPU time to 2 seconds.

sharding_propagation.cc

HloSharding alternative_sub_sharding =
    user.sharding().GetSubSharding(user.shape(), {i});
if (user.operand(i) == &instruction &&
    hlo_sharding_util::IsShardingMoreSpecific(alternative_sub_sharding,
                                              sub_sharding)) {
  sub_sharding = alternative_sub_sharding;
}
if (user.operand(i) == &instruction) {
  // Only evaluate GetSubSharding if this operand is of interest,
  // as it is relatively expensive.
  HloSharding alternative_sub_sharding =
      user.sharding().GetSubSharding(user.shape(), {i});
  if (hlo_sharding_util::IsShardingMoreSpecific(
          alternative_sub_sharding, sub_sharding)) {
    sub_sharding = alternative_sub_sharding;
  }
}

Don't update stats eagerly; compute them on demand.

Do not update stats on the very frequent allocation/deallocation calls. Instead, compute stats on demand when the much less frequently called Stats() method is invoked.

Preallocate 10 nodes not 200 for query handling in Google's web server.

A simple change that reduced web server's CPU usage by 7.5%.

querytree.h

static const int kInitParseTreeSize = 200;   // initial size of querynode pool
static const int kInitParseTreeSize = 10;   // initial size of querynode pool

Change search order for 19% throughput improvement.

An old search system (circa 2000) had two tiers: one contained a full-text index, and the other tier contained just the index for the title and anchor terms. We used to search the smaller title/anchor tier first. Counter-intuitively, we found that it is cheaper to search the larger full-text index tier first since if we reach the end of the full-text tier, we can entirely skip searching the title/anchor tier (a subset of the full-text tier). This happened reasonably often and allowed us to reduce the average number of disk seeks to process a query.

See discussion of title and anchor text handling in The Anatomy of a Large-Scale Hypertextual Web Search Engine for background information.

Specialize code

A particular performance-sensitive call-site may not need the full generality provided by a general-purpose library. Consider writing specialized code in such cases instead of calling the general-purpose code if it provides a performance improvement.

Custom printing code for Histogram class is 4x as fast as sprintf.

This code is performance sensitive because it is invoked when monitoring systems gather statistics from various servers.

histogram_export.cc

void Histogram::PopulateBuckets(const string &prefix,
                                expvar::MapProto *const var) const {
                                ...
  for (int i = min_bucket; i <= max_bucket; ++i) {
    const double count = BucketCount(i);
    if (!export_empty_buckets && count == 0.0) continue;
    acc += count;
    // The label format of exported buckets for discrete histograms
    // specifies an inclusive upper bound, which is the same as in
    // the original Histogram implementation.  This format is not
    // applicable to non-discrete histograms, so a half-open interval
    // is used for them, with "_" instead of "-" as a separator to
    // make possible to distinguish the formats.
    string key =
        options_.export_cumulative_counts() ?
            StringPrintf("%.12g", boundaries_->BucketLimit(i)) :
        options_.discrete() ?
            StringPrintf("%.0f-%.0f",
                         ceil(boundaries_->BucketStart(i)),
                         ceil(boundaries_->BucketLimit(i)) - 1.0) :
            StringPrintf("%.12g_%.12g",
                         boundaries_->BucketStart(i),
                         boundaries_->BucketLimit(i));
    EscapeMapKey(&key);
    const double value = options_.export_cumulative_counts() ? acc : count;
    expvar::AddMapFloat(StrCat(prefix,
                               options_.export_bucket_key_prefix(),
                               key),
                        value * count_mult,
                        var);
  }
// Format "val" according to format.  If "need_escape" is true, then the
// format can produce output with a '.' in it, and the result will be escaped.
// If "need_escape" is false, then the caller guarantees that format is
// such that the resulting number will not have any '.' characters and
// therefore we can avoid calling EscapeKey.
// The function is free to use "*scratch" for scratch space if necessary,
// and the resulting StringPiece may point into "*scratch".
static StringPiece FormatNumber(const char* format,
                                bool need_escape,
                                double val, string* scratch) {
  // This routine is specialized to work with only a limited number of formats
  DCHECK(StringPiece(format) == "%.0f" || StringPiece(format) == "%.12g");

  scratch->clear();
  if (val == trunc(val) && val >= kint32min && val <= kint32max) {
    // An integer for which we can just use StrAppend
    StrAppend(scratch, static_cast<int32>(val));
    return StringPiece(*scratch);
  } else if (isinf(val)) {
    // Infinity, represent as just 'inf'.
    return StringPiece("inf", 3);
  } else {
    // Format according to "format", and possibly escape.
    StringAppendF(scratch, format, val);
    if (need_escape) {
      EscapeMapKey(scratch);
    } else {
      DCHECK(!StringPiece(*scratch).contains("."));
    }
    return StringPiece(*scratch);
  }
}
...
void Histogram::PopulateBuckets(const string &prefix,
                                expvar::MapProto *const var) const {
                                ...
  const string full_key_prefix = StrCat(prefix,
                                        options_.export_bucket_key_prefix());
  string key = full_key_prefix;  // Keys will start with "full_key_prefix".
  string start_scratch;
  string limit_scratch;
  const bool cumul_counts = options_.export_cumulative_counts();
  const bool discrete = options_.discrete();
  for (int i = min_bucket; i <= max_bucket; ++i) {
    const double count = BucketCount(i);
    if (!export_empty_buckets && count == 0.0) continue;
    acc += count;
    // The label format of exported buckets for discrete histograms
    // specifies an inclusive upper bound, which is the same as in
    // the original Histogram implementation.  This format is not
    // applicable to non-discrete histograms, so a half-open interval
    // is used for them, with "_" instead of "-" as a separator to
    // make possible to distinguish the formats.
    key.resize(full_key_prefix.size());  // Start with full_key_prefix.
    DCHECK_EQ(key, full_key_prefix);

    const double limit = boundaries_->BucketLimit(i);
    if (cumul_counts) {
      StrAppend(&key, FormatNumber("%.12g", true, limit, &limit_scratch));
    } else {
      const double start = boundaries_->BucketStart(i);
      if (discrete) {
        StrAppend(&key,
                  FormatNumber("%.0f", false, ceil(start), &start_scratch),
                  "-",
                  FormatNumber("%.0f", false, ceil(limit) - 1.0,
                               &limit_scratch));
      } else {
        StrAppend(&key,
                  FormatNumber("%.12g", true, start, &start_scratch),
                  "_",
                  FormatNumber("%.12g", true, limit, &limit_scratch));
      }
    }
    const double value = cumul_counts ? acc : count;

    // Add to map var
    expvar::AddMapFloat(key, value * count_mult, var);
  }
}

Add specializations for VLOG(1), VLOG(2), … for speed and smaller code size.

VLOG is a heavily used macro throughout the code base. This change avoids passing an extra integer constant at nearly every call site (if the log level is constant at the call site, as it almost always is, as in VLOG(1) << ...), which saves code space.

vlog_is_on.h

class VLogSite final {
 public:
  ...
  bool IsEnabled(int level) {
    int stale_v = v_.load(std::memory_order_relaxed);
    if (ABSL_PREDICT_TRUE(level > stale_v)) {
      return false;
    }

    // We put everything other than the fast path, i.e. vlogging is initialized
    // but not on, behind an out-of-line function to reduce code size.
    return SlowIsEnabled(stale_v, level);
  }
  ...
 private:
  ...
  ABSL_ATTRIBUTE_NOINLINE
  bool SlowIsEnabled(int stale_v, int level);
  ...
};
class VLogSite final {
 public:
  ...
  bool IsEnabled(int level) {
    int stale_v = v_.load(std::memory_order_relaxed);
    if (ABSL_PREDICT_TRUE(level > stale_v)) {
      return false;
    }

    // We put everything other than the fast path, i.e. vlogging is initialized
    // but not on, behind an out-of-line function to reduce code size.
    // "level" is almost always a call-site constant, so we can save a bit
    // of code space by special-casing for levels 1, 2, and 3.
#if defined(__has_builtin) && __has_builtin(__builtin_constant_p)
    if (__builtin_constant_p(level)) {
      if (level == 0) return SlowIsEnabled0(stale_v);
      if (level == 1) return SlowIsEnabled1(stale_v);
      if (level == 2) return SlowIsEnabled2(stale_v);
      if (level == 3) return SlowIsEnabled3(stale_v);
      if (level == 4) return SlowIsEnabled4(stale_v);
      if (level == 5) return SlowIsEnabled5(stale_v);
    }
#endif
    return SlowIsEnabled(stale_v, level);
    ...
 private:
  ...
  ABSL_ATTRIBUTE_NOINLINE
  bool SlowIsEnabled(int stale_v, int level);
  ABSL_ATTRIBUTE_NOINLINE bool SlowIsEnabled0(int stale_v);
  ABSL_ATTRIBUTE_NOINLINE bool SlowIsEnabled1(int stale_v);
  ABSL_ATTRIBUTE_NOINLINE bool SlowIsEnabled2(int stale_v);
  ABSL_ATTRIBUTE_NOINLINE bool SlowIsEnabled3(int stale_v);
  ABSL_ATTRIBUTE_NOINLINE bool SlowIsEnabled4(int stale_v);
  ABSL_ATTRIBUTE_NOINLINE bool SlowIsEnabled5(int stale_v);
  ...
};

vlog_is_on.cc

bool VLogSite::SlowIsEnabled0(int stale_v) { return SlowIsEnabled(stale_v, 0); }
bool VLogSite::SlowIsEnabled1(int stale_v) { return SlowIsEnabled(stale_v, 1); }
bool VLogSite::SlowIsEnabled2(int stale_v) { return SlowIsEnabled(stale_v, 2); }
bool VLogSite::SlowIsEnabled3(int stale_v) { return SlowIsEnabled(stale_v, 3); }
bool VLogSite::SlowIsEnabled4(int stale_v) { return SlowIsEnabled(stale_v, 4); }
bool VLogSite::SlowIsEnabled5(int stale_v) { return SlowIsEnabled(stale_v, 5); }

Replace RE2 call with a simple prefix match when possible.

read_matcher.cc

enum MatchItemType {
  MATCH_TYPE_INVALID,
  MATCH_TYPE_RANGE,
  MATCH_TYPE_EXACT,
  MATCH_TYPE_REGEXP,
};
enum MatchItemType {
  MATCH_TYPE_INVALID,
  MATCH_TYPE_RANGE,
  MATCH_TYPE_EXACT,
  MATCH_TYPE_REGEXP,
  MATCH_TYPE_PREFIX,   // Special type for regexp ".*"
};

read_matcher.cc

p->type = MATCH_TYPE_REGEXP;
term.NonMetaPrefix().CopyToString(&p->prefix);
if (term.RegexpSuffix() == ".*") {
  // Special case for a regexp that matches anything, so we can
  // bypass RE2::FullMatch
  p->type = MATCH_TYPE_PREFIX;
} else {
  p->type = MATCH_TYPE_REGEXP;

Use StrCat rather than StringPrintf to format IP addresses.

ipaddress.cc

string IPAddress::ToString() const {
  char buf[INET6_ADDRSTRLEN];

  switch (address_family_) {
    case AF_INET:
      CHECK(inet_ntop(AF_INET, &addr_.addr4, buf, INET6_ADDRSTRLEN) != NULL);
      return buf;
    case AF_INET6:
      CHECK(inet_ntop(AF_INET6, &addr_.addr6, buf, INET6_ADDRSTRLEN) != NULL);
      return buf;
    case AF_UNSPEC:
      LOG(DFATAL) << "Calling ToString() on an empty IPAddress";
      return "";
    default:
      LOG(FATAL) << "Unknown address family " << address_family_;
  }
}
...
string IPAddressToURIString(const IPAddress& ip) {
  switch (ip.address_family()) {
    case AF_INET6:
      return StringPrintf("[%s]", ip.ToString().c_str());
    default:
      return ip.ToString();
  }
}
...
string SocketAddress::ToString() const {
  return IPAddressToURIString(host_) + StringPrintf(":%u", port_);
}
string IPAddress::ToString() const {
  char buf[INET6_ADDRSTRLEN];

  switch (address_family_) {
    case AF_INET: {
      uint32 addr = gntohl(addr_.addr4.s_addr);
      int a1 = static_cast<int>((addr >> 24) & 0xff);
      int a2 = static_cast<int>((addr >> 16) & 0xff);
      int a3 = static_cast<int>((addr >> 8) & 0xff);
      int a4 = static_cast<int>(addr & 0xff);
      return StrCat(a1, ".", a2, ".", a3, ".", a4);
    }
    case AF_INET6:
      CHECK(inet_ntop(AF_INET6, &addr_.addr6, buf, INET6_ADDRSTRLEN) != NULL);
      return buf;
    case AF_UNSPEC:
      LOG(DFATAL) << "Calling ToString() on an empty IPAddress";
      return "";
    default:
      LOG(FATAL) << "Unknown address family " << address_family_;
  }
}
...
string IPAddressToURIString(const IPAddress& ip) {
  switch (ip.address_family()) {
    case AF_INET6:
      return StrCat("[", ip.ToString(), "]");
    default:
      return ip.ToString();
  }
}
...
string SocketAddress::ToString() const {
  return StrCat(IPAddressToURIString(host_), ":", port_);
}

Use caching to avoid repeated work

Cache based on precomputed fingerprint of large serialized proto.

dp_ops.cc

InputOutputMappingProto mapping_proto;
PLAQUE_OP_REQUIRES(
    mapping_proto.ParseFromStringPiece(GetAttrMappingProto(state)),
    absl::InternalError("Failed to parse InputOutputMappingProto"));
ParseMapping(mapping_proto);
uint64 mapping_proto_fp = GetAttrMappingProtoFp(state);
{
  absl::MutexLock l(&fp_to_iometa_mu);
  if (fp_to_iometa == nullptr) {
    fp_to_iometa =
        new absl::flat_hash_map<uint64, std::unique_ptr<ProgramIOMetadata>>;
  }
  auto it = fp_to_iometa->find(mapping_proto_fp);
  if (it != fp_to_iometa->end()) {
    io_metadata_ = it->second.get();
  } else {
    auto serial_proto = GetAttrMappingProto(state);
    DCHECK_EQ(mapping_proto_fp, Fingerprint(serial_proto));
    InputOutputMappingProto mapping_proto;
    PLAQUE_OP_REQUIRES(
        mapping_proto.ParseFromStringPiece(GetAttrMappingProto(state)),
        absl::InternalError("Failed to parse InputOutputMappingProto"));
    auto io_meta = ParseMapping(mapping_proto);
    io_metadata_ = io_meta.get();
    (*fp_to_iometa)[mapping_proto_fp] = std::move(io_meta);
  }
}

Make the compiler’s job easier

The compiler may have trouble optimizing through layers of abstractions because it must make conservative assumptions about the overall behavior of the code, or may not make the right speed vs. size tradeoffs. The application programmer will often know more about the behavior of the system and can aid the compiler by rewriting the code to operate at a lower level. However, only do this when profiles show an issue since compilers will often get things right on their own. Looking at the generated assembly code for performance critical routines can help you understand if the compiler is “getting it right”. Pprof provides a very helpful display of source code interleaved with disassembly and annotated with performance data.

Some techniques that may be useful:

  1. Avoid functions calls in hot functions (allows the compiler to avoid frame setup costs).
  2. Move slow-path code into a separate tail-called function.
  3. Copy small amounts of data into local variables before heavy use. This can let the compiler assume there is no aliasing with other data, which may improve auto-vectorization and register allocation.
  4. Hand-unroll very hot loops.

Speed up ShapeUtil::ForEachState by replacing absl::Span with raw pointers to the underlying arrays.

shape_util.h

struct ForEachState {
  ForEachState(const Shape& s, absl::Span<const int64_t> b,
               absl::Span<const int64_t> c, absl::Span<const int64_t> i);
  ~ForEachState();

  const Shape& shape;
  const absl::Span<const int64_t> base;
  const absl::Span<const int64_t> count;
  const absl::Span<const int64_t> incr;
struct ForEachState {
  ForEachState(const Shape& s, absl::Span<const int64_t> b,
               absl::Span<const int64_t> c, absl::Span<const int64_t> i);
  inline ~ForEachState() = default;

  const Shape& shape;
  // Pointers to arrays of the passed-in spans
  const int64_t* const base;
  const int64_t* const count;
  const int64_t* const incr;

Hand unroll cyclic redundancy check (CRC) computation loop.

crc.cc

void CRC32::Extend(uint64 *lo, uint64 *hi, const void *bytes, size_t length)
                      const {
                      ...
  // Process bytes 4 at a time
  while ((p + 4) <= e) {
    uint32 c = l ^ WORD(p);
    p += 4;
    l = this->table3_[c & 0xff] ^
        this->table2_[(c >> 8) & 0xff] ^
        this->table1_[(c >> 16) & 0xff] ^
        this->table0_[c >> 24];
  }

  // Process the last few bytes
  while (p != e) {
    int c = (l & 0xff) ^ *p++;
    l = this->table0_[c] ^ (l >> 8);
  }
  *lo = l;
}
void CRC32::Extend(uint64 *lo, uint64 *hi, const void *bytes, size_t length)
                      const {
                      ...
#define STEP {                                  \
    uint32 c = l ^ WORD(p);                     \
    p += 4;                                     \
    l = this->table3_[c & 0xff] ^               \
        this->table2_[(c >> 8) & 0xff] ^        \
        this->table1_[(c >> 16) & 0xff] ^       \
        this->table0_[c >> 24];                 \
}

  // Process bytes 16 at a time
  while ((e-p) >= 16) {
    STEP;
    STEP;
    STEP;
    STEP;
  }

  // Process bytes 4 at a time
  while ((p + 4) <= e) {
    STEP;
  }
#undef STEP

  // Process the last few bytes
  while (p != e) {
    int c = (l & 0xff) ^ *p++;
    l = this->table0_[c] ^ (l >> 8);
  }
  *lo = l;
}

Handle four characters at a time when parsing Spanner keys.

  1. Hand unroll loop to deal with four characters at a time rather than using memchr

  2. Manually unroll loop for finding separated sections of name

  3. Go backwards to find separated portions of a name with '#' separators (rather than forwards) since the first part is likely the longest in the name.

key.cc

void Key::InitSeps(const char* start) {
  const char* base = &rep_[0];
  const char* limit = base + rep_.size();
  const char* s = start;

  DCHECK_GE(s, base);
  DCHECK_LT(s, limit);

  for (int i = 0; i < 3; i++) {
    s = (const char*)memchr(s, '#', limit - s);
    DCHECK(s != NULL);
    seps_[i] = s - base;
    s++;
  }
}
inline const char* ScanBackwardsForSep(const char* base, const char* p) {
  while (p >= base + 4) {
    if (p[0] == '#') return p;
    if (p[-1] == '#') return p-1;
    if (p[-2] == '#') return p-2;
    if (p[-3] == '#') return p-3;
    p -= 4;
  }
  while (p >= base && *p != '#') p--;
  return p;
}

void Key::InitSeps(const char* start) {
  const char* base = &rep_[0];
  const char* limit = base + rep_.size();
  const char* s = start;

  DCHECK_GE(s, base);
  DCHECK_LT(s, limit);

  // We go backwards from the end of the string, rather than forwards,
  // since the directory name might be long and definitely doesn't contain
  // any '#' characters.
  const char* p = ScanBackwardsForSep(s, limit - 1);
  DCHECK(*p == '#');
  seps_[2] = p - base;
  p--;

  p = ScanBackwardsForSep(s, p);
  DCHECK(*p == '#');
  seps_[1] = p - base;
  p--;

  p = ScanBackwardsForSep(s, p);
  DCHECK(*p == '#');
  seps_[0] = p - base;
}

Avoid frame setup costs by converting ABSL_LOG(FATAL) to ABSL_DCHECK(false).

arena_cleanup.h

inline ABSL_ATTRIBUTE_ALWAYS_INLINE size_t Size(Tag tag) {
  if (!EnableSpecializedTags()) return sizeof(DynamicNode);

  switch (tag) {
    case Tag::kDynamic:
      return sizeof(DynamicNode);
    case Tag::kString:
      return sizeof(TaggedNode);
    case Tag::kCord:
      return sizeof(TaggedNode);
    default:
      ABSL_LOG(FATAL) << "Corrupted cleanup tag: " << static_cast<int>(tag);
      return sizeof(DynamicNode);
  }
}
inline ABSL_ATTRIBUTE_ALWAYS_INLINE size_t Size(Tag tag) {
  if (!EnableSpecializedTags()) return sizeof(DynamicNode);

  switch (tag) {
    case Tag::kDynamic:
      return sizeof(DynamicNode);
    case Tag::kString:
      return sizeof(TaggedNode);
    case Tag::kCord:
      return sizeof(TaggedNode);
    default:
      ABSL_DCHECK(false) << "Corrupted cleanup tag: " << static_cast<int>(tag);
      return sizeof(DynamicNode);
  }
}

Reduce stats collection costs

Balance the utility of stats and other behavioral information about a system against the cost of maintaining that information. The extra information can often help people to understand and improve high-level behavior, but can also be costly to maintain.

Stats that are not useful can be dropped altogether.

Stop maintaining expensive stats about number of alarms and closures in SelectServer.

Part of changes that reduce time for setting an alarm from 771 ns to 271 ns.

selectserver.h

class SelectServer {
 public:
 ...
 protected:
  ...
  scoped_ptr<MinuteTenMinuteHourStat> num_alarms_stat_;
  ...
  scoped_ptr<MinuteTenMinuteHourStat> num_closures_stat_;
  ...
};
// Selectserver class
class SelectServer {
 ...
 protected:
 ...
};

/selectserver.cc

void SelectServer::AddAlarmInternal(Alarmer* alarmer,
                                    int offset_in_ms,
                                    int id,
                                    bool is_periodic) {
                                    ...
  alarms_->insert(alarm);
  num_alarms_stat_->IncBy(1);
  ...
}
void SelectServer::AddAlarmInternal(Alarmer* alarmer,
                                    int offset_in_ms,
                                    int id,
                                    bool is_periodic) {
                                    ...
  alarms_->Add(alarm);
  ...
}

/selectserver.cc

void SelectServer::RemoveAlarm(Alarmer* alarmer, int id) {
      ...
      alarms_->erase(alarm);
      num_alarms_stat_->IncBy(-1);
      ...
}
void SelectServer::RemoveAlarm(Alarmer* alarmer, int id) {
      ...
      alarms_->Remove(alarm);
      ...
}

Often, stats or other properties can be maintained for a sample of the elements handled by the system (e.g., RPC requests, input records, users). Many subsystems use this approach (tcmalloc allocation tracking, /requestz status pages, Dapper samples).

When sampling, consider reducing the sampling rate when appropriate.

Maintain stats for just a sample of doc info requests.

Sampling allows us to avoid touching 39 histograms and MinuteTenMinuteHour stats for most requests.

generic-leaf-stats.cc

... code that touches 39 histograms to update various stats on every request ...
// Add to the histograms periodically
if (TryLockToUpdateHistogramsDocInfo(docinfo_stats, bucket)) {
  // Returns true and grabs bucket->lock only if we should sample this
  // request for maintaining stats
  ... code that touches 39 histograms to update various stats ...
  bucket->lock.Unlock();
}

Reduce sampling rate and make faster sampling decisions.

This change reduces the sampling rate from 1 in 10 to 1 in 32. Furthermore, we now keep execution time stats just for the sampled events and speed up sampling decisions by using a power of two modulus. This code is called on every packet in the Google Meet video conferencing system and needed performance work to keep up with capacity demands during the first part of the COVID outbreak as users rapidly migrated to doing more online meetings.

packet_executor.cc

class ScopedPerformanceMeasurement {
 public:
  explicit ScopedPerformanceMeasurement(PacketExecutor* packet_executor)
      : packet_executor_(packet_executor),
        tracer_(packet_executor->packet_executor_trace_threshold_,
                kClosureTraceName) {
    // ThreadCPUUsage is an expensive call. At the time of writing,
    // it takes over 400ns, or roughly 30 times slower than absl::Now,
    // so we sample only 10% of closures to keep the cost down.
    if (packet_executor->closures_executed_ % 10 == 0) {
      thread_cpu_usage_start_ = base::ThreadCPUUsage();
    }

    // Sample start time after potentially making the above expensive call,
    // so as not to pollute wall time measurements.
    run_start_time_ = absl::Now();
  }

  ~ScopedPerformanceMeasurement() {
ScopedPerformanceMeasurement::ScopedPerformanceMeasurement(
    PacketExecutor* packet_executor)
    : packet_executor_(packet_executor),
      tracer_(packet_executor->packet_executor_trace_threshold_,
              kClosureTraceName) {
  // ThreadCPUUsage is an expensive call. At the time of writing,
  // it takes over 400ns, or roughly 30 times slower than absl::Now,
  // so we sample only 1 in 32 closures to keep the cost down.
  if (packet_executor->closures_executed_ % 32 == 0) {
    thread_cpu_usage_start_ = base::ThreadCPUUsage();
  }

  // Sample start time after potentially making the above expensive call,
  // so as not to pollute wall time measurements.
  run_start_time_ = absl::Now();
}

packet_executor.cc

~ScopedPerformanceMeasurement() {
  auto run_end_time = absl::Now();
  auto run_duration = run_end_time - run_start_time_;

  if (thread_cpu_usage_start_.has_value()) {
  ...
  }

  closure_execution_time->Record(absl::ToInt64Microseconds(run_duration));
ScopedPerformanceMeasurement::~ScopedPerformanceMeasurement() {
  auto run_end_time = absl::Now();
  auto run_duration = run_end_time - run_start_time_;

  if (thread_cpu_usage_start_.has_value()) {
    ...
    closure_execution_time->Record(absl::ToInt64Microseconds(run_duration));
  }

Benchmark results:

Run on (40 X 2793 MHz CPUs); 2020-03-24T20:08:19.991412535-07:00
CPU: Intel Ivybridge with HyperThreading (20 cores) dL1:32KB dL2:256KB dL3:25MB
Benchmark                                      Base (ns)    New (ns) Improvement
----------------------------------------------------------------------------
BM_PacketOverhead_mean                               224          85    +62.0%

Avoid logging on hot code paths

Logging statements can be costly, even if the logging-level for the statement doesn’t actually log anything. E.g., ABSL_VLOG’s implementation requires at least a load and a comparison, which may be a problem in hot code paths. In addition, the presence of the logging code may inhibit compiler optimizations. Consider dropping logging entirely from hot code paths.

Remove logging from guts of memory allocator.

This was a small part of a larger change.

gpu_bfc_allocator.cc

void GPUBFCAllocator::SplitChunk(...) {
  ...
  VLOG(6) << "Adding to chunk map: " << new_chunk->ptr;
  ...
}
...
void GPUBFCAllocator::DeallocateRawInternal(void* ptr) {
  ...
  VLOG(6) << "Chunk at " << c->ptr << " no longer in use";
  ...
}
void GPUBFCAllocator::SplitChunk(...) {
...
}
...
void GPUBFCAllocator::DeallocateRawInternal(void* ptr) {
...
}

Precompute whether or not logging is enabled outside a nested loop.

image_similarity.cc

for (int j = 0; j < output_subimage_size_y; j++) {
  int j1 = j - rad + output_to_integral_subimage_y;
  int j2 = j1 + 2 * rad + 1;
  // Create a pointer for this row's output, taking into account the offset
  // to the full image.
  double *image_diff_ptr = &(*image_diff)(j + min_j, min_i);

  for (int i = 0; i < output_subimage_size_x; i++) {
    ...
    if (VLOG_IS_ON(3)) {
    ...
    }
    ...
  }
}
const bool vlog_3 = DEBUG_MODE ? VLOG_IS_ON(3) : false;

for (int j = 0; j < output_subimage_size_y; j++) {
  int j1 = j - rad + output_to_integral_subimage_y;
  int j2 = j1 + 2 * rad + 1;
  // Create a pointer for this row's output, taking into account the offset
  // to the full image.
  double *image_diff_ptr = &(*image_diff)(j + min_j, min_i);

  for (int i = 0; i < output_subimage_size_x; i++) {
    ...
    if (vlog_3) {
    ...
    }
  }
}
Run on (40 X 2801 MHz CPUs); 2016-05-16T15:55:32.250633072-07:00
CPU: Intel Ivybridge with HyperThreading (20 cores) dL1:32KB dL2:256KB dL3:25MB
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_NCCPerformance/16                   29104     26372     +9.4%
BM_NCCPerformance/64                  473235    425281    +10.1%
BM_NCCPerformance/512               30246238  27622009     +8.7%
BM_NCCPerformance/1k              125651445  113361991     +9.8%
BM_NCCLimitedBoundsPerformance/16       8314      7498     +9.8%
BM_NCCLimitedBoundsPerformance/64     143508    132202     +7.9%
BM_NCCLimitedBoundsPerformance/512   9335684   8477567     +9.2%
BM_NCCLimitedBoundsPerformance/1k   37223897  34201739     +8.1%

Precompute whether logging is enabled and use the result in helper routines.

periodic_call.cc

  VLOG(1) << Logid()
          << "MaybeScheduleAlarmAtNextTick. Time until next real time: "
          << time_until_next_real_time;
          ...
  uint64 next_virtual_time_ms =
      next_virtual_time_ms_ - num_ticks * kResolutionMs;
  CHECK_GE(next_virtual_time_ms, 0);
  ScheduleAlarm(now, delay, next_virtual_time_ms);
}

void ScheduleNextAlarm(uint64 current_virtual_time_ms)
    ABSL_EXCLUSIVE_LOCKS_REQUIRED(mutex_) {
  if (calls_.empty()) {
    VLOG(1) << Logid() << "No calls left, entering idle mode";
    next_real_time_ = absl::InfiniteFuture();
    return;
  }
  uint64 next_virtual_time_ms = FindNextVirtualTime(current_virtual_time_ms);
  auto delay =
      absl::Milliseconds(next_virtual_time_ms - current_virtual_time_ms);
  ScheduleAlarm(GetClock().TimeNow(), delay, next_virtual_time_ms);
}

// An alarm scheduled by this function supersedes all previously scheduled
// alarms. This is ensured through `scheduling_sequence_number_`.
void ScheduleAlarm(absl::Time now, absl::Duration delay,
                   uint64 virtual_time_ms)
    ABSL_EXCLUSIVE_LOCKS_REQUIRED(mutex_) {
  next_real_time_ = now + delay;
  next_virtual_time_ms_ = virtual_time_ms;
  ++ref_count_;  // The Alarm holds a reference.
  ++scheduling_sequence_number_;
  VLOG(1) << Logid() << "ScheduleAlarm. Time : "
          << absl::FormatTime("%M:%S.%E3f", now, absl::UTCTimeZone())
          << ", delay: " << delay << ", virtual time: " << virtual_time_ms
          << ", refs: " << ref_count_
          << ", seq: " << scheduling_sequence_number_
          << ", executor: " << executor_;

  executor_->AddAfter(
      delay, new Alarm(this, virtual_time_ms, scheduling_sequence_number_));
}
  const bool vlog_1 = VLOG_IS_ON(1);

  if (vlog_1) {
    VLOG(1) << Logid()
            << "MaybeScheduleAlarmAtNextTick. Time until next real time: "
            << time_until_next_real_time;
  }
  ...
  uint64 next_virtual_time_ms =
      next_virtual_time_ms_ - num_ticks * kResolutionMs;
  CHECK_GE(next_virtual_time_ms, 0);
  ScheduleAlarm(now, delay, next_virtual_time_ms, vlog_1);
}

void ScheduleNextAlarm(uint64 current_virtual_time_ms, bool vlog_1)
    ABSL_EXCLUSIVE_LOCKS_REQUIRED(mutex_) {
  if (calls_.empty()) {
    if (vlog_1) {
      VLOG(1) << Logid() << "No calls left, entering idle mode";
    }
    next_real_time_ = absl::InfiniteFuture();
    return;
  }
  uint64 next_virtual_time_ms = FindNextVirtualTime(current_virtual_time_ms);
  auto delay =
      absl::Milliseconds(next_virtual_time_ms - current_virtual_time_ms);
  ScheduleAlarm(GetClock().TimeNow(), delay, next_virtual_time_ms, vlog_1);
}

// An alarm scheduled by this function supersedes all previously scheduled
// alarms. This is ensured through `scheduling_sequence_number_`.
void ScheduleAlarm(absl::Time now, absl::Duration delay,
                   uint64 virtual_time_ms,
                   bool vlog_1)
    ABSL_EXCLUSIVE_LOCKS_REQUIRED(mutex_) {
  next_real_time_ = now + delay;
  next_virtual_time_ms_ = virtual_time_ms;
  ++ref_count_;  // The Alarm holds a reference.
  ++scheduling_sequence_number_;
  if (vlog_1) {
    VLOG(1) << Logid() << "ScheduleAlarm. Time : "
            << absl::FormatTime("%M:%S.%E3f", now, absl::UTCTimeZone())
            << ", delay: " << delay << ", virtual time: " << virtual_time_ms
            << ", refs: " << ref_count_
            << ", seq: " << scheduling_sequence_number_
            << ", executor: " << executor_;
  }

  executor_->AddAfter(
      delay, new Alarm(this, virtual_time_ms, scheduling_sequence_number_));
}

Code size considerations

Performance encompasses more than just runtime speed. Sometimes it is worth considering the effects of software choices on the size of generated code. Large code size means longer compile and link times, bloated binaries, more memory usage, more icache pressure, and other sometimes negative effects on microarchitectural structures like branch predictors, etc. Thinking about these issues is especially important when writing low-level library code that will be used in many places, or when writing templated code that you expect will be instantiated for many different types.

The techniques that are useful for reducing code size vary significantly across programming languages. Here are some techniques that we have found useful for C++ code (which can suffer from an over-use of templates and inlining).

Trim commonly inlined code

Widely called functions combined with inlining can have a dramatic effect on code size.

Speed up TF_CHECK_OK.

Avoid creating Ok object, and save code space by doing complex formatting of fatal error message out of line instead of at every call site.

status.h

#define TF_CHECK_OK(val) CHECK_EQ(::tensorflow::Status::OK(), (val))
#define TF_QCHECK_OK(val) QCHECK_EQ(::tensorflow::Status::OK(), (val))
extern tensorflow::string* TfCheckOpHelperOutOfLine(
    const ::tensorflow::Status& v, const char* msg);
inline tensorflow::string* TfCheckOpHelper(::tensorflow::Status v,
                                           const char* msg) {
  if (v.ok()) return nullptr;
  return TfCheckOpHelperOutOfLine(v, msg);
}
#define TF_CHECK_OK(val)                                           \
  while (tensorflow::string* _result = TfCheckOpHelper(val, #val)) \
  LOG(FATAL) << *(_result)
#define TF_QCHECK_OK(val)                                          \
  while (tensorflow::string* _result = TfCheckOpHelper(val, #val)) \
  LOG(QFATAL) << *(_result)

status.cc

string* TfCheckOpHelperOutOfLine(const ::tensorflow::Status& v,
                                 const char* msg) {
  string r("Non-OK-status: ");
  r += msg;
  r += " status: ";
  r += v.ToString();
  // Leaks string but this is only to be used in a fatal error message
  return new string(r);
}

Shrink each RETURN_IF_ERROR call site by 79 bytes of code.

  1. Added special adaptor class for use by just RETURN_IF_ERROR.
  2. Do not construct/destruct StatusBuilder on fast path of RETURN_IF_ERROR.
  3. Do not inline some StatusBuilder methods since they are now no longer needed on the fast path.
  4. Avoid unnecessary ~Status call.

Improve performance of CHECK_GE by 4.5X and shrink code size from 125 bytes to 77 bytes.

logging.h

struct CheckOpString {
  CheckOpString(string* str) : str_(str) { }
  ~CheckOpString() { delete str_; }
  operator bool() const { return str_ == NULL; }
  string* str_;
};
...
#define DEFINE_CHECK_OP_IMPL(name, op) \
  template <class t1, class t2> \
  inline string* Check##name##Impl(const t1& v1, const t2& v2, \
                                   const char* names) { \
    if (v1 op v2) return NULL; \
    else return MakeCheckOpString(v1, v2, names); \
  } \
  string* Check##name##Impl(int v1, int v2, const char* names);
DEFINE_CHECK_OP_IMPL(EQ, ==)
DEFINE_CHECK_OP_IMPL(NE, !=)
DEFINE_CHECK_OP_IMPL(LE, <=)
DEFINE_CHECK_OP_IMPL(LT, < )
DEFINE_CHECK_OP_IMPL(GE, >=)
DEFINE_CHECK_OP_IMPL(GT, > )
#undef DEFINE_CHECK_OP_IMPL
struct CheckOpString {
  CheckOpString(string* str) : str_(str) { }
  // No destructor: if str_ is non-NULL, we're about to LOG(FATAL),
  // so there's no point in cleaning up str_.
  operator bool() const { return str_ == NULL; }
  string* str_;
};
...
extern string* MakeCheckOpStringIntInt(int v1, int v2, const char* names);

template<int, int>
string* MakeCheckOpString(const int& v1, const int& v2, const char* names) {
  return MakeCheckOpStringIntInt(v1, v2, names);
}
...
#define DEFINE_CHECK_OP_IMPL(name, op) \
  template <class t1, class t2> \
  inline string* Check##name##Impl(const t1& v1, const t2& v2, \
                                   const char* names) { \
    if (v1 op v2) return NULL; \
    else return MakeCheckOpString(v1, v2, names); \
  } \
  inline string* Check##name##Impl(int v1, int v2, const char* names) { \
    if (v1 op v2) return NULL; \
    else return MakeCheckOpString(v1, v2, names); \
  }
DEFINE_CHECK_OP_IMPL(EQ, ==)
DEFINE_CHECK_OP_IMPL(NE, !=)
DEFINE_CHECK_OP_IMPL(LE, <=)
DEFINE_CHECK_OP_IMPL(LT, < )
DEFINE_CHECK_OP_IMPL(GE, >=)
DEFINE_CHECK_OP_IMPL(GT, > )
#undef DEFINE_CHECK_OP_IMPL

logging.cc

string* MakeCheckOpStringIntInt(int v1, int v2, const char* names) {
  strstream ss;
  ss << names << " (" << v1 << " vs. " << v2 << ")";
  return new string(ss.str(), ss.pcount());
}

Inline with care

Inlining can often improve performance, but sometimes it can increase code size without a corresponding performance payoff (and in some case even a performance loss due to increased instruction cache pressure).

Reduce inlining in TensorFlow.

The change stops inlining many non-performance-sensitive functions (e.g., error paths and op registration code). Furthermore, slow paths of some performance-sensitive functions are moved into non-inlined functions.

These changes reduces the size of tensorflow symbols in a typical binary by 12.2% (8814545 bytes down to 7740233 bytes)

Protocol buffer library change. Avoid expensive inlined code space for encoding message length for messages ≥ 128 bytes and instead do a procedure call to a shared out-of-line routine.

Not only makes important large binaries smaller but also faster.

Bytes of generated code per line of a heavily inlined routine in one large binary. First number represents the total bytes generated for a particular source line including all locations where that code has been inlined.

Before:

.           0   1825 template <typename MessageType>
.           0   1826 inline uint8* WireFormatLite::InternalWriteMessage(
.           0   1827     int field_number, const MessageType& value, uint8* target,
.           0   1828     io::EpsCopyOutputStream* stream) {
>>>    389246   1829   target = WriteTagToArray(field_number, WIRETYPE_LENGTH_DELIMITED, target);
>>>   5454640   1830   target = io::CodedOutputStream::WriteVarint32ToArray(
>>>    337837   1831       static_cast<uint32>(value.GetCachedSize()), target);
>>>   1285539   1832   return value._InternalSerialize(target, stream);
.           0   1833 }

The new codesize output with this change looks like:

.           0   1825 template <typename MessageType>
.           0   1826 inline uint8* WireFormatLite::InternalWriteMessage(
.           0   1827     int field_number, const MessageType& value, uint8* target,
.           0   1828     io::EpsCopyOutputStream* stream) {
>>>    450612   1829   target = WriteTagToArray(field_number, WIRETYPE_LENGTH_DELIMITED, target);
>>       9609   1830   target = io::CodedOutputStream::WriteVarint32ToArrayOutOfLine(
>>>    434668   1831       static_cast<uint32>(value.GetCachedSize()), target);
>>>   1597394   1832   return value._InternalSerialize(target, stream);
.           0   1833 }

coded_stream.h

class PROTOBUF_EXPORT CodedOutputStream {
  ...
  // Like WriteVarint32()  but writing directly to the target array, and with the
  // less common-case paths being out of line rather than inlined.
  static uint8* WriteVarint32ToArrayOutOfLine(uint32 value, uint8* target);
  ...
};
...
inline uint8* CodedOutputStream::WriteVarint32ToArrayOutOfLine(uint32 value,
                                                               uint8* target) {
  target[0] = static_cast<uint8>(value);
  if (value < 0x80) {
    return target + 1;
  } else {
    return WriteVarint32ToArrayOutOfLineHelper(value, target);
  }
}

coded_stream.cc

uint8* CodedOutputStream::WriteVarint32ToArrayOutOfLineHelper(uint32 value,
                                                              uint8* target) {
  DCHECK_GE(value, 0x80);
  target[0] |= static_cast<uint8>(0x80);
  value >>= 7;
  target[1] = static_cast<uint8>(value);
  if (value < 0x80) {
    return target + 2;
  }
  target += 2;
  do {
    // Turn on continuation bit in the byte we just wrote.
    target[-1] |= static_cast<uint8>(0x80);
    value >>= 7;
    *target = static_cast<uint8>(value);
    ++target;
  } while (value >= 0x80);
  return target;
}

Reduce absl::flat_hash_set and absl::flat_hash_map code size.

  1. Extract code that does not depend on the specific hash table type into common (non-inlined) functions.
  2. Place ABSL_ATTRIBUTE_NOINLINE directives judiciously.
  3. Out-of-line some slow paths.

Reduces sizes of some large binaries by ~0.5%.

Do not inline string allocation and deallocation when not using protobuf arenas.

public/arenastring.h

  if (IsDefault(default_value)) {
    std::string* new_string = new std::string();
    tagged_ptr_.Set(new_string);
    return new_string;
  } else {
    return UnsafeMutablePointer();
  }
}
  if (IsDefault(default_value)) {
    return SetAndReturnNewString();
  } else {
    return UnsafeMutablePointer();
  }
}

internal/arenastring.cc

std::string* ArenaStringPtr::SetAndReturnNewString() {
  std::string* new_string = new std::string();
  tagged_ptr_.Set(new_string);
  return new_string;
}

Avoid inlining some routines. Create variants of routines that take 'const char*' rather than 'const std::string&' to avoid std::string construction code at every call site.

op.h

class OpDefBuilderWrapper {
 public:
  explicit OpDefBuilderWrapper(const char name[]) : builder_(name) {}
  OpDefBuilderWrapper& Attr(std::string spec) {
    builder_.Attr(std::move(spec));
    return *this;
  }
  OpDefBuilderWrapper& Input(std::string spec) {
    builder_.Input(std::move(spec));
    return *this;
  }
  OpDefBuilderWrapper& Output(std::string spec) {
    builder_.Output(std::move(spec));
    return *this;
  }
class OpDefBuilderWrapper {
 public:
  explicit OpDefBuilderWrapper(const char name[]) : builder_(name) {}
  OpDefBuilderWrapper& Attr(std::string spec) {
    builder_.Attr(std::move(spec));
    return *this;
  }
  OpDefBuilderWrapper& Attr(const char* spec) TF_ATTRIBUTE_NOINLINE {
    return Attr(std::string(spec));
  }
  OpDefBuilderWrapper& Input(std::string spec) {
    builder_.Input(std::move(spec));
    return *this;
  }
  OpDefBuilderWrapper& Input(const char* spec) TF_ATTRIBUTE_NOINLINE {
    return Input(std::string(spec));
  }
  OpDefBuilderWrapper& Output(std::string spec) {
    builder_.Output(std::move(spec));
    return *this;
  }
  OpDefBuilderWrapper& Output(const char* spec) TF_ATTRIBUTE_NOINLINE {
    return Output(std::string(spec));
  }

Reduce template instantiations

Templated code can be duplicated for every possible combination of template arguments when it is instantiated.

Replace template argument with a regular argument.

Changed a large routine templated on a bool to instead take the bool as an extra argument. (The bool was only being used once to select one of two string constants, so a run-time check was just fine.) This reduced the # of instantiations of the large routine from 287 to 143.

sharding_util_ops.cc

template <bool Split>
Status GetAndValidateAttributes(OpKernelConstruction* ctx,
                                std::vector<int32>& num_partitions,
                                int& num_slices, std::vector<int32>& paddings,
                                bool& has_paddings) {
  absl::string_view num_partitions_attr_name =
      Split ? kNumSplitsAttrName : kNumConcatsAttrName;
      ...
  return OkStatus();
}
Status GetAndValidateAttributes(bool split, OpKernelConstruction* ctx,
                                std::vector<int32>& num_partitions,
                                int& num_slices, std::vector<int32>& paddings,
                                bool& has_paddings) {
  absl::string_view num_partitions_attr_name =
      split ? kNumSplitsAttrName : kNumConcatsAttrName;
      ...
  return OkStatus();
}

Move bulky code from templated constructor to a non-templated shared base class constructor.

Also reduce number of template instantiations from one for every combination of <T, Device, Rank> to one for every <T> and every <Rank>.

sharding_util_ops.cc

template <typename Device, typename T>
class XlaSplitNDBaseOp : public OpKernel {
 public:
  explicit XlaSplitNDBaseOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
    OP_REQUIRES_OK(
        ctx, GetAndValidateAttributes(/*split=*/true, ctx, num_splits_,
                                      num_slices_, paddings_, has_paddings_));
  }
// Shared base class to save code space
class XlaSplitNDShared : public OpKernel {
 public:
  explicit XlaSplitNDShared(OpKernelConstruction* ctx) TF_ATTRIBUTE_NOINLINE
      : OpKernel(ctx),
        num_slices_(1),
        has_paddings_(false) {
    GetAndValidateAttributes(/*split=*/true, ctx, num_splits_, num_slices_,
                             paddings_, has_paddings_);
  }

Reduce generated code size for absl::flat_hash_set and absl::flat_hash_map.

  • Extract code that does not depend on the specific hash table type into common (non-inlined) functions.
  • Place ABSL_ATTRIBUTE_NOINLINE directives judiciously.
  • Move some slow paths out of line.

Reduce container operations

Consider the impact of map and other container operations since each call to such and operation can produce large amounts of generated code.

Turn many map insertion calls in a row to initialize a hash table of emoji characters into a single bulk insert operation (188KB of text down to 360 bytes in library linked into many binaries). 😊

textfallback_init.h

inline void AddEmojiFallbacks(TextFallbackMap *map) {
  (*map)[0xFE000] = &kFE000;
  (*map)[0xFE001] = &kFE001;
  (*map)[0xFE002] = &kFE002;
  (*map)[0xFE003] = &kFE003;
  (*map)[0xFE004] = &kFE004;
  (*map)[0xFE005] = &kFE005;
  ...
  (*map)[0xFEE7D] = &kFEE7D;
  (*map)[0xFEEA0] = &kFEEA0;
  (*map)[0xFE331] = &kFE331;
};
inline void AddEmojiFallbacks(TextFallbackMap *map) {
#define PAIR(x) {0x##x, &k##x}
  // clang-format off
  map->insert({
    PAIR(FE000),
    PAIR(FE001),
    PAIR(FE002),
    PAIR(FE003),
    PAIR(FE004),
    PAIR(FE005),
    ...
    PAIR(FEE7D),
    PAIR(FEEA0),
    PAIR(FE331)});
  // clang-format on
#undef PAIR
};

Stop inlining a heavy user of InlinedVector operations.

Moved very long routine that was being inlined from .h file to .cc (no real performance benefit from inlining this).

reduction_ops_common.h

Status Simplify(const Tensor& data, const Tensor& axis,
                const bool keep_dims) {
  ... Eighty line routine body ...
}
Status Simplify(const Tensor& data, const Tensor& axis, const bool keep_dims);

Parallelization and synchronization

Exploit parallelism

Modern machines have many cores, and they are often underutilized. Expensive work may therefore be completed faster by parallelizing it. The most common approach is to process different items in parallel and combine the results when done. Typically, the items are first partitioned into batches to avoid paying the cost of running something in parallel per item.

Four-way parallelization improves the rate of encoding tokens by ~3.6x.

blocked-token-coder.cc

MutexLock l(&encoder_threads_lock);
if (encoder_threads == NULL) {
  encoder_threads = new ThreadPool(NumCPUs());
  encoder_threads->SetStackSize(262144);
  encoder_threads->StartWorkers();
}
encoder_threads->Add
    (NewCallback(this,
                 &BlockedTokenEncoder::EncodeRegionInThread,
                 region_tokens, N, region,
                 stats,
                 controller_->GetClosureWithCost
                 (NewCallback(&DummyCallback), N)));

Parallelization improves decoding performance by 5x.

coding.cc

for (int c = 0; c < clusters->size(); c++) {
  RET_CHECK_OK(DecodeBulkForCluster(...);
}
struct SubTask {
  absl::Status result;
  absl::Notification done;
};

std::vector<SubTask> tasks(clusters->size());
for (int c = 0; c < clusters->size(); c++) {
  options_.executor->Schedule([&, c] {
    tasks[c].result = DecodeBulkForCluster(...);
    tasks[c].done.Notify();
  });
}
for (int c = 0; c < clusters->size(); c++) {
  tasks[c].done.WaitForNotification();
}
for (int c = 0; c < clusters->size(); c++) {
  RETURN_IF_ERROR(tasks[c].result);
}

The effect on system performance should be measured carefully – if spare CPU is not available, or if memory bandwidth is saturated, parallelization may not help, or may even hurt.

Amortize lock acquisition

Avoid fine-grained locking to reduce the cost of Mutex operations in hot paths. Caveat: this should only be done if the change does not increase lock contention.

Acquire lock once to free entire tree of query nodes, rather than reacquiring lock for every node in tree.

mustang-query.cc

// Pool of query nodes
ThreadSafeFreeList<MustangQuery> pool_(256);
...
void MustangQuery::Release(MustangQuery* node) {
  if (node == NULL)
    return;
  for (int i=0; i < node->children_->size(); ++i)
    Release((*node->children_)[i]);
  node->children_->clear();
  pool_.Delete(node);
}
// Pool of query nodes
Mutex pool_lock_;
FreeList<MustangQuery> pool_(256);
...
void MustangQuery::Release(MustangQuery* node) {
  if (node == NULL)
    return;
  MutexLock l(&pool_lock_);
  ReleaseLocked(node);
}

void MustangQuery::ReleaseLocked(MustangQuery* node) {
#ifndef NDEBUG
  pool_lock_.AssertHeld();
#endif
  if (node == NULL)
    return;
  for (int i=0; i < node->children_->size(); ++i)
    ReleaseLocked((*node->children_)[i]);
  node->children_->clear();
  pool_.Delete(node);
}

Keep critical sections short

Avoid expensive work inside critical sections. In particular, watch out for innocuous looking code that might be doing RPCs or accessing files.

Reduce number of cache lines touched in critical section.

Careful data structure adjustments reduce the number of cache lines accessed significantly and improve the performance of an ML training run by 3.3%.

  1. Precompute some per-node type properties as bits within the NodeItem data structure, meaning that we can avoid touching the Node* object for outgoing edges in the critical section.
  2. Change ExecutorState::ActivateNodes to use the NodeItem of the destination node for each outgoing edge, rather than touching fields in the *item->node object. Typically this means that we touch 1 or 2 cache lines total for accessing the needed edge data, rather than ~2 + O(num_outgoing edges) (and for large graphs with many cores executing them there is also less TLB pressure).

Avoid RPC while holding Mutex.

trainer.cc

{
  // Notify the parameter server that we are starting.
  MutexLock l(&lock_);
  model_ = model;
  MaybeRecordProgress(last_global_step_);
}
bool should_start_record_progress = false;
int64 step_for_progress = -1;
{
  // Notify the parameter server that we are starting.
  MutexLock l(&lock_);
  model_ = model;
  should_start_record_progress = ShouldStartRecordProgress();
  step_for_progress = last_global_step_;
}
if (should_start_record_progress) {
  StartRecordProgress(step_for_progress);
}

Also, be wary of expensive destructors that will run before a Mutex is unlocked (this can often happen when the Mutex unlock is triggered by a ~MutexUnlock.) Declaring objects with expensive destructors before MutexLock may help (assuming it is thread-safe).

Reduce contention by sharding

Sometimes a data structure protected by a Mutex that is exhibiting high contention can be safely split into multiple shards, each shard with its own Mutex. (Note: this requires that there are no cross-shard invariants between the different shards.)

Shard a cache 16 ways which improves throughput under a multi-threaded load by ~2x.

cache.cc

class ShardedLRUCache : public Cache {
 private:
  LRUCache shard_[kNumShards];
  port::Mutex id_mutex_;
  uint64_t last_id_;

  static inline uint32_t HashSlice(const Slice& s) {
    return Hash(s.data(), s.size(), 0);
  }

  static uint32_t Shard(uint32_t hash) {
    return hash >> (32 - kNumShardBits);
  }
  ...
  virtual Handle* Lookup(const Slice& key) {
    const uint32_t hash = HashSlice(key);
    return shard_[Shard(hash)].Lookup(key, hash);
  }

Shard spanner data structure for tracking calls.

transaction_manager.cc

absl::MutexLock l(&active_calls_in_mu_);
ActiveCallMap::const_iterator iter = active_calls_in_.find(m->tid());
if (iter != active_calls_in_.end()) {
  iter->second.ExtractElements(&m->tmp_calls_);
}
ActiveCalls::LockedShard shard(active_calls_in_, m->tid());
const ActiveCallMap& active_calls_map = shard.active_calls_map();
ActiveCallMap::const_iterator iter = active_calls_map.find(m->tid());
if (iter != active_calls_map.end()) {
  iter->second.ExtractElements(&m->tmp_calls_);
}

If the data structure in question is a map, consider using a concurrent hash map implementation instead.

Be careful with the information used for shard selection. If, for example, you use some bits of a hash value for shard selection and then those same bits end up being used again later, the latter use may perform poorly since it sees a skewed distribution of hash values.

Fix information used for shard selection to prevent hash table issues.

netmon_map_impl.h

ConnectionBucket* GetBucket(Index index) {
  // Rehash the hash to make sure we are not partitioning the buckets based on
  // the original hash. If num_buckets_ is a power of 2 that would drop the
  // entropy of the buckets.
  size_t original_hash = absl::Hash<Index>()(index);
  int hash = absl::Hash<size_t>()(original_hash) % num_buckets_;
  return &buckets_[hash];
}
ConnectionBucket* GetBucket(Index index) {
  absl::Hash<std::pair<Index, size_t>> hasher{};
  // Combine the hash with 42 to prevent shard selection using the same bits
  // as the underlying hashtable.
  return &buckets_[hasher({index, 42}) % num_buckets_];
}

Shard Spanner data structure used for tracking calls.

This CL partitions the ActiveCallMap into 64 shards. Each shard is protected by a separate mutex. A given transaction will be mapped to exactly one shard. A new interface LockedShard(tid) is added for accessing the ActiveCallMap for a transaction in a thread-safe manner. Example usage:

transaction_manager.cc

{
  absl::MutexLock l(&active_calls_in_mu_);
  delayed_locks_timer_ring_.Add(delayed_locks_flush_time_ms, tid);
}
{
  ActiveCalls::LockedShard shard(active_calls_in_, tid);
  shard.delayed_locks_timer_ring().Add(delayed_locks_flush_time_ms, tid);
}

The results show a 69% reduction in overall wall-clock time when running the benchmark with 8192 fibers

Benchmark                   Time(ns)        CPU(ns)     Iterations
------------------------------------------------------------------
BM_ActiveCalls/8k        11854633492     98766564676            10
BM_ActiveCalls/16k       26356203552    217325836709            10
Benchmark                   Time(ns)        CPU(ns)     Iterations
------------------------------------------------------------------
BM_ActiveCalls/8k         3696794642     39670670110            10
BM_ActiveCalls/16k        7366284437     79435705713            10

SIMD Instructions

Explore whether handling multiple items at once using SIMD instructions available on modern CPUs can give speedups (e.g., see absl::flat_hash_map discussion below in Bulk Operations section).

Reduce false sharing

If different threads access different mutable data, consider placing the different data items on different cache lines, e.g., in C++ using the alignas directive. However, these directives are easy to misuse and may increase object sizes significantly, so make sure performance measurements justify their use.

Segregate commonly mutated fields in a different cache line than other fields.

histogram.h

HistogramOptions options_;
...
internal::HistogramBoundaries *boundaries_;
...
std::vector<double> buckets_;

double min_;             // Minimum.
double max_;             // Maximum.
double count_;           // Total count of occurrences.
double sum_;             // Sum of values.
double sum_of_squares_;  // Sum of squares of values.
...
RegisterVariableExporter *exporter_;
  HistogramOptions options_;
  ...
  internal::HistogramBoundaries *boundaries_;
  ...
  RegisterVariableExporter *exporter_;
  ...
  // Place the following fields in a dedicated cacheline as they are frequently
  // mutated, so we can avoid potential false sharing.
  ...
#ifndef SWIG
  alignas(ABSL_CACHELINE_SIZE)
#endif
  std::vector<double> buckets_;

  double min_;             // Minimum.
  double max_;             // Maximum.
  double count_;           // Total count of occurrences.
  double sum_;             // Sum of values.
  double sum_of_squares_;  // Sum of squares of values.

Reduce frequency of context switches

Process small work items inline instead of on device thread pool.

cast_op.cc

template <typename Device, typename Tout, typename Tin>
void CastMaybeInline(const Device& d, typename TTypes<Tout>::Flat o,
                     typename TTypes<Tin>::ConstFlat i) {
  if (o.size() * (sizeof(Tin) + sizeof(Tout)) < 16384) {
    // Small cast on a CPU: do inline
    o = i.template cast<Tout>();
  } else {
    o.device(d) = i.template cast<Tout>();
  }
}

Use buffered channels for pipelining

Channels can be unbuffered which means that a writer blocks until a reader is ready to pick up an item. Unbuffered channels can be useful when the channel is being used for synchronization, but not when the channel is being used to increase parallelism.

Consider lock-free approaches

Sometimes lock-free data structures can make a difference over more conventional mutex-protected data structures. However, direct atomic variable manipulation can be dangerous. Prefer higher-level abstractions.

Use lock-free map to manage a cache of RPC channels.

Entries in an RPC stub cache are read thousands of times a second and modified rarely. Switching to an appropriate lock-free map reduces search latency by 3%-5%.

Use a fixed lexicon+lock-free hash map to speed-up determining IsValidTokenId.

dynamic_token_class_manager.h

mutable Mutex mutex_;

// The density of this hash map is guaranteed by the fact that the
// dynamic lexicon reuses previously allocated TokenIds before trying
// to allocate new ones.
dense_hash_map<TokenId, common::LocalTokenClassId> tid_to_cid_
    GUARDED_BY(mutex_);
// Read accesses to this hash-map should be done using
// 'epoch_gc_'::(EnterFast / LeaveFast). The writers should periodically
// GC the deleted entries, by simply invoking LockFreeHashMap::CreateGC.
typedef util::gtl::LockFreeHashMap<TokenId, common::LocalTokenClassId>
    TokenIdTokenClassIdMap;
TokenIdTokenClassIdMap tid_to_cid_;

Protocol Buffer advice

Protobufs are a convenient representation of data, especially if the data will be sent over the wire or stored persistently. However, they can have significant performance costs. For example, a piece of code that fills in a list of 1000 points and then sums up the Y coordinates, speeds up by a factor of 20 when converted from protobufs to a C++ std::vector of structs!

Benchmark code for both versions.

name                old time/op  new time/op  delta
BenchmarkIteration  17.4µs ± 5%   0.8µs ± 1%  -95.30%  (p=0.000 n=11+12)

Protobuf version:

message PointProto {
  int32 x = 1;
  int32 y = 2;
}
message PointListProto {
  repeated PointProto points = 1;
}
void SumProto(const PointListProto& vec) {
  int sum = 0;
  for (const PointProto& p : vec.points()) {
    sum += p.y();
  }
  ABSL_VLOG(1) << sum;
}

void BenchmarkIteration() {
  PointListProto points;
  points.mutable_points()->Reserve(1000);
  for (int i = 0; i < 1000; i++) {
    PointProto* p = points.add_points();
    p->set_x(i);
    p->set_y(i * 2);
  }
  SumProto(points);
}

Non-protobuf version:

struct PointStruct {
  int x;
  int y;
};

void SumVector(const std::vector<PointStruct>& vec) {
  int sum = 0;
  for (const PointStruct& p : vec) {
    sum += p.y;
  }
  ABSL_VLOG(1) << sum;
}

void BenchmarkIteration() {
  std::vector<PointStruct> points;
  points.reserve(1000);
  for (int i = 0; i < 1000; i++) {
    points.push_back({i, i * 2});
  }
  SumVector(points);
}

In addition, the protobuf version adds a few kilobytes of code and data to the binary, which may not seem like much, but adds up quickly in systems with many protobuf types. This increased size creates performance problems by creating i-cache and d-cache pressure.

Here are some tips related to protobuf performance:

Do not use protobufs unnecessarily.

Given the factor of 20 performance difference described above, if some data is never serialized or parsed, you probably should not put it in a protocol buffer. The purpose of protocol buffers is to make it easy to serialize and deserialize data structures, but they can have significant code-size, memory, and CPU overheads. Do not use them if all you want are some of the other niceties like DebugString and copyability.

Avoid unnecessary message hierarchies.

Message hierarchy can be useful to organize information in a more readable fashion. However, the extra level of message hierarchy incurs overheads like memory allocations, function calls, cache misses, larger serialized messages, etc.

E.g., instead of:

message Foo {
  optional Bar bar = 1;
}
message Bar {
  optional Baz baz = 1;
}
message Baz {
  optional int32 count = 1;
}

Prefer:

message Foo {
  optional int32 count = 1;
}

A protocol buffer message corresponds to a message class in C++ generated code and emits a tag and the length of the payload on the wire. To carry an integer, the old form requires more allocations (and deallocations) and emits a larger amount of generated code. As a result, all protocol buffer operations (parsing, serialization, size, etc.) become more expensive, having to traverse the message tree. The new form does not have such overhead and is more efficient.

Use small field numbers for frequently occurring fields.

Protobufs use a variable length integer representation for the combination of field number and wire format (see the protobuf encoding documentation). This representation is 1 byte for field numbers between 1 and 15, and two bytes for field numbers between 16 and 2047. (Field numbers 2048 or greater should typically be avoided.)

Consider pre-reserving some small field numbers for future extension of performance-sensitive protobufs.

Choose carefully between int32, sint32, fixed32, and uint32 (and similarly for the 64 bit variants).

Generally, use int32 or int64, but use fixed32 or fixed64 for large values like hash codes and sint32 or sint64 for values are that are often negative.

A varint occupies fewer bytes to encode small integers and can save space at the cost of more expensive decoding. However, it can take up more space for negative or large values. In that case, using fixed32 or fixed64 (instead of uint32 or uint64) reduces size with much cheaper encoding and decoding. For small negative integers, use sint32 or sint64 instead of int32 or int64.d

For proto2, pack repeated numeric fields by annotating them with [packed=true].

In proto2, repeated values are serialized as a sequence of (tag, value) pairs by default. This is inefficient because tags have to be decoded for every element.

Packed repeated primitives are serialized with the length of the payload first followed by values without tags. When using fixed-width values, we can avoid reallocations by knowing the final size the moment we start parsing; i.e., no reallocation cost. We still don't know how many varints are in the payload and may have to pay the reallocation cost.

In proto3, repeated fields are packed by default.

Packed works best with fixed-width values like fixed32, fixed64, float, double, etc. since the entire encoded length can be predetermined by multiplying the number of elements by the fixed value size, instead of having to calculate the length of each individual element.

Use bytes instead for string for binary data and large values.

The string type holds UTF8-encoded text, and can sometimes require validation. The bytes type can hold an arbitrary sequence of bytes (non-text data) and is often more appropriate as well as more efficient than string.

Consider string_type = VIEW to avoid copying.

Copying a big string or bytes field during parsing is expensive. Such cost can often be avoided by marking the field with string_type = VIEW.

message Image {
  ...
  bytes jpeg_encoding = 4 [features.(pb.cpp).string_type=VIEW];
}

Without the VIEW annotation, when the protocol buffer is parsed, the potentially large field contents are copied from the serialized protocol buffer to a string object in memory. Depending on the number of string or bytes fields and the size of those fields, the overhead of copying can be significant.

Instead of copying the big binary blobs, routines like ParseFromStringWithAliasing use absl::string_view to reference the original backing string. Note that the backing string (the serialized protocol buffer) must outlive the protocol buffer instance that contains the alias.

Consider using Cord for large fields to reduce copying costs.

Annotating large bytes and string fields with [ctype=CORD] may reduce copying costs. This annotation changes the representation of the field from std::string to absl::Cord. absl::Cord uses reference counting and tree-based storage to reduce copying and appending costs. If a protocol buffer is serialized to a cord, parsing a string or bytes field with [ctype=CORD] can avoid copying the field contents.

message Document {
  ...
  bytes html = 4 [ctype = CORD];
}

Performance of a Cord field depends on length distribution and access patterns. Use benchmarks to validate such changes.

Use protobuf arenas in C++ code.

Consider using arenas to save allocation and deallocation costs, especially for protobufs containing repeated, string, or message fields.

Message and string fields are heap-allocated (even if the top-level protocol buffer object is stack-allocated). If a protocol buffer message has a lot of sub message fields and string fields, allocation and deallocation cost can be significant. Arenas amortize allocation costs and makes deallocation virtually free. It also improves memory locality by allocating from contiguous chunks of memory.

Keep .proto files small

Do not put too many messages in a single .proto file. Once you rely on anything at all from a .proto file, the entire file will get pulled in by the linker even if it's mostly unused. This increases build times and binary sizes. You can use extensions and Any to avoid creating hard dependencies on big .proto files with many message types.

Consider storing protocol buffers in serialized form, even in memory.

In-memory protobuf objects have a large memory footprint (often 5x the wire format size), potentially spread across many cache lines. So if your application is going to keep many protobuf objects live for long periods of time, consider storing them in serialized form.

Avoid protobuf map fields.

Protobuf map fields have performance problems that usually outweigh the small syntactic convenience they provide. Prefer using non-protobuf maps initialized from protobuf contents:

msg.proto

map<string, bytes> env_variables = 5;
message Var {
  string key = 1;
  bytes value = 2;
}
repeated Var env_variables = 5;

Use protobuf message definition with a subset of the fields.

If you want to access only a few fields of a large message type, consider defining your own protocol buffer message type that mimics the original type, but only defines the fields that you care about. Here's an example:

message FullMessage {
  optional int32 field1 = 1;
  optional BigMessage field2 = 2;
  optional int32 field3 = 3;
  repeater AnotherBigMessage field4 = 4;
  ...
  optional int32 field100 = 100;
}
message SubsetMessage {
  optional int32 field3 = 3;
  optional int32 field88 = 88;
}

By parsing a serialized FullMessage into a SubsetMessage, only two out of a hundred fields are parsed and others are treated as unknown fields. Consider using APIs that discard unknown fields to improve performance even more when appropriate.

Reuse protobuf objects when possible.

Declare protobuf objects outside loops so that their allocated storage can be reused across loop iterations.

C++-Specific advice

absl::flat_hash_map (and set)

Absl hash tables usually out-perform C++ standard library containers such as std::map and std::unordered_map.

Speed up LanguageFromCode (use absl::flat_hash_map instead of a __gnu_cxx::hash_map).

languages.cc

class CodeToLanguage
    ...
    : public __gnu_cxx::hash_map<absl::string_view, i18n::languages::Language,
                                 CodeHash, CodeCompare> {
class CodeToLanguage
    ...
    : public absl::flat_hash_map<absl::string_view, i18n::languages::Language,
                                 CodeHash, CodeCompare> {

Benchmark results:

name               old time/op  new time/op  delta
BM_CodeToLanguage  19.4ns ± 1%  10.2ns ± 3%  -47.47%  (p=0.000 n=8+10)

Speed up stats publish/unpublish (an older change, so uses dense_hash_map instead of absl::flat_hash_map, which did not exist at the time).

publish.cc

typedef hash_map<uint64, Publication*> PublicationMap;
static PublicationMap* publications = NULL;
typedef dense_hash_map<uint64, Publication*> PublicationMap;;
static PublicationMap* publications GUARDED_BY(mu) = NULL;

Use dense_hash_map instead of hash_map for keeping track of SelectServer alarms (would use absl::flat_hash_map today).

alarmer.h

typedef hash_map<int, Alarm*> AlarmList;
typedef dense_hash_map<int, Alarm*> AlarmList;

absl::btree_map/absl::btree_set

absl::btree_map and absl::btree_set store multiple entries per tree node. This has a number of advantages over ordered C++ standard library containers such as std::map. First, the pointer overhead of pointing to child tree nodes is often significantly reduced. Second, because the entries or key/values are stored consecutively in memory for a given btree tree node, cache efficiency is often significantly better.

Use btree_set instead of std::set to represent a very heavily used work-queue.

register_allocator.h

using container_type = std::set<WorklistItem>;
using container_type = absl::btree_set<WorklistItem>;

util::bitmap::InlinedBitVector

util::bitmap::InlinedBitvector can store short bit-vectors inline, and therefore can often be a better choice than std::vector<bool> or other bitmap types.

Use InlinedBitVector instead of std::vector<bool>, and then use FindNextBitSet to find the next item of interest.

block_encoder.cc

vector<bool> live_reads(nreads);
...
for (int offset = 0; offset < b_.block_width(); offset++) {
  ...
  for (int r = 0; r < nreads; r++) {
    if (live_reads[r]) {
util::bitmap::InlinedBitVector<4096> live_reads(nreads);
...
for (int offset = 0; offset < b_.block_width(); offset++) {
  ...
  for (size_t r = 0; live_reads.FindNextSetBit(&r); r++) {
    DCHECK(live_reads[r]);

absl::InlinedVector

absl::InlinedVector stores a small number of elements inline (configurable via the second template argument). This enables small vectors up to this number of elements to generally have better cache efficiency and also to avoid allocating a backing store array at all when the number of elements is small.

Use InlinedVector instead of std::vector in various places.

bundle.h

class Bundle {
 public:
 ...
 private:
  // Sequence of (slotted instruction, unslotted immediate operands).
  std::vector<InstructionRecord> instructions_;
  ...
};
class Bundle {
 public:
 ...
 private:
  // Sequence of (slotted instruction, unslotted immediate operands).
  absl::InlinedVector<InstructionRecord, 2> instructions_;
  ...
};

gtl::vector32

Saves space by using a customized vector type that only supports sizes that fit in 32 bits.

Simple type change saves ~8TiB of memory in Spanner.

table_ply.h

class TablePly {
    ...
    // Returns the set of data columns stored in this file for this table.
    const std::vector<FamilyId>& modified_data_columns() const {
      return modified_data_columns_;
    }
    ...
   private:
    ...
    std::vector<FamilyId> modified_data_columns_;  // Data columns in the table.
#include "util/gtl/vector32.h"
    ...
    // Returns the set of data columns stored in this file for this table.
    absl::Span<const FamilyId> modified_data_columns() const {
      return modified_data_columns_;
    }
    ...

    ...
    // Data columns in the table.
    gtl::vector32<FamilyId> modified_data_columns_;

gtl::small_map

gtl::small_map uses an inline array to store up to a certain number of unique key-value-pair elements, but upgrades itself automatically to be backed by a user-specified map type when it runs out of space.

Use gtl::small_map in tflite_model.

tflite_model.cc

using ChoiceIdToContextMap = gtl::flat_hash_map<int, TFLiteContext*>;
using ChoiceIdToContextMap =
    gtl::small_map<gtl::flat_hash_map<int, TFLiteContext*>>;

gtl::small_ordered_set

gtl::small_ordered_set is an optimization for associative containers (such as std::set or absl::btree_multiset). It uses a fixed array to store a certain number of elements, then reverts to using a set or multiset when it runs out of space. For sets that are typically small, this can be considerably faster than using something like set directly, as set is optimized for large data sets. This change shrinks cache footprint and reduces critical section length.

Use gtl::small_ordered_set to hold set of listeners.

broadcast_stream.h

class BroadcastStream : public ParsedRtpTransport {
 ...
 private:
  ...
  std::set<ParsedRtpTransport*> listeners_ ABSL_GUARDED_BY(listeners_mutex_);
};
class BroadcastStream : public ParsedRtpTransport {
 ...
 private:
  ...
  using ListenersSet =
      gtl::small_ordered_set<std::set<ParsedRtpTransport*>, 10>;
  ListenersSet listeners_ ABSL_GUARDED_BY(listeners_mutex_);

gtl::intrusive_list

gtl::intrusive_list<T> is a doubly-linked list where the link pointers are embedded in the elements of type T. It saves one cache line+indirection per element when compared to std::list<T*>.

Use intrusive_list to keep track of inflight requests for each index row update.

row-update-sender-inflight-set.h

std::set<int64> inflight_requests_ GUARDED_BY(mu_);
class SeqNum : public gtl::intrusive_link<SeqNum> {
  ...
  int64 val_ = -1;
  ...
};
...
gtl::intrusive_list<SeqNum> inflight_requests_ GUARDED_BY(mu_);

Limit absl::Status and absl::StatusOr usage

Even though absl::Status and absl::StatusOr types are fairly efficient, they have a non-zero overhead even in the success path and should therefore be avoided for hot routines that don’t need to return any meaningful error details (or perhaps never even fail!):

Avoid StatusOr<int64> return type for RoundUpToAlignment() function.

best_fit_allocator.cc

absl::StatusOr<int64> BestFitAllocator::RoundUpToAlignment(int64 bytes) const {
  TPU_RET_CHECK_GE(bytes, 0);

  const int64 max_aligned = MathUtil::RoundDownTo<int64>(
      std::numeric_limits<int64>::max(), alignment_in_bytes_);
  if (bytes > max_aligned) {
    return util::ResourceExhaustedErrorBuilder(ABSL_LOC)
           << "Attempted to allocate "
           << strings::HumanReadableNumBytes::ToString(bytes)
           << " which after aligning to "
           << strings::HumanReadableNumBytes::ToString(alignment_in_bytes_)
           << " cannot be expressed as an int64.";
  }

  return MathUtil::RoundUpTo<int64>(bytes, alignment_in_bytes_);
}

best_fit_allocator.h

// Rounds bytes up to nearest multiple of alignment_.
// REQUIRES: bytes >= 0.
// REQUIRES: result does not overflow int64.
// REQUIRES: alignment_in_bytes_ is a power of 2 (checked in constructor).
int64 RoundUpToAlignment(int64 bytes) const {
  DCHECK_GE(bytes, 0);
  DCHECK_LE(bytes, max_aligned_bytes_);
  int64 result =
      ((bytes + (alignment_in_bytes_ - 1)) & ~(alignment_in_bytes_ - 1));
  DCHECK_EQ(result, MathUtil::RoundUpTo<int64>(bytes, alignment_in_bytes_));
  return result;
}

Add ShapeUtil::ForEachIndexNoStatus to avoid creating a Status return object for every element of a tensor.

shape_util.h

using ForEachVisitorFunction =
    absl::FunctionRef<StatusOr<bool>(absl::Span<const int64_t>)>;
    ...
static void ForEachIndex(const Shape& shape, absl::Span<const int64_t> base,
                         absl::Span<const int64_t> count,
                         absl::Span<const int64_t> incr,
                         const ForEachVisitorFunction& visitor_function);

using ForEachVisitorFunctionNoStatus =
    absl::FunctionRef<bool(absl::Span<const int64_t>)>;
    ...
static void ForEachIndexNoStatus(
    const Shape& shape, absl::Span<const int64_t> base,
    absl::Span<const int64_t> count, absl::Span<const int64_t> incr,
    const ForEachVisitorFunctionNoStatus& visitor_function);

literal.cc

ShapeUtil::ForEachIndex(
    result_shape, [&](absl::Span<const int64_t> output_index) {
      for (int64_t i = 0, end = dimensions.size(); i < end; ++i) {
        scratch_source_index[i] = output_index[dimensions[i]];
      }
      int64_t dest_index = IndexUtil::MultidimensionalIndexToLinearIndex(
          result_shape, output_index);
      int64_t source_index = IndexUtil::MultidimensionalIndexToLinearIndex(
          shape(), scratch_source_index);
      memcpy(dest_data + primitive_size * dest_index,
             source_data + primitive_size * source_index, primitive_size);
      return true;
    });
ShapeUtil::ForEachIndexNoStatus(
    result_shape, [&](absl::Span<const int64_t> output_index) {
      // Compute dest_index
      int64_t dest_index = IndexUtil::MultidimensionalIndexToLinearIndex(
          result_shape, result_minor_to_major, output_index);

      // Compute source_index
      int64_t source_index;
      for (int64_t i = 0, end = dimensions.size(); i < end; ++i) {
        scratch_source_array[i] = output_index[dimensions[i]];
      }
      if (src_shape_dims == 1) {
        // Fast path for this case
        source_index = scratch_source_array[0];
        DCHECK_EQ(source_index,
                  IndexUtil::MultidimensionalIndexToLinearIndex(
                      src_shape, src_minor_to_major, scratch_source_span));
      } else {
        source_index = IndexUtil::MultidimensionalIndexToLinearIndex(
            src_shape, src_minor_to_major, scratch_source_span);
      }
      // Move one element from source_index in source to dest_index in dest
      memcpy(dest_data + PRIMITIVE_SIZE * dest_index,
             source_data + PRIMITIVE_SIZE * source_index, PRIMITIVE_SIZE);
      return true;
    });

In TF_CHECK_OK, avoid creating Ok object in order to test for ok().

status.h

#define TF_CHECK_OK(val) CHECK_EQ(::tensorflow::Status::OK(), (val))
#define TF_QCHECK_OK(val) QCHECK_EQ(::tensorflow::Status::OK(), (val))
extern tensorflow::string* TfCheckOpHelperOutOfLine(
    const ::tensorflow::Status& v, const char* msg);
inline tensorflow::string* TfCheckOpHelper(::tensorflow::Status v,
                                           const char* msg) {
  if (v.ok()) return nullptr;
  return TfCheckOpHelperOutOfLine(v, msg);
}
#define TF_CHECK_OK(val)                                           \
  while (tensorflow::string* _result = TfCheckOpHelper(val, #val)) \
  LOG(FATAL) << *(_result)
#define TF_QCHECK_OK(val)                                          \
  while (tensorflow::string* _result = TfCheckOpHelper(val, #val)) \
  LOG(QFATAL) << *(_result)

Remove StatusOr from the hot path of remote procedure calls (RPCs).

Removal of StatusOr from a hot path eliminated a 14% CPU regression in RPC benchmarks caused by an earlier change.

privacy_context.h

absl::StatusOr<privacy::context::PrivacyContext> GetRawPrivacyContext(
    const CensusHandle& h);

privacy_context_statusfree.h

enum class Result {
  kSuccess,
  kNoRootScopedData,
  kNoPrivacyContext,
  kNoDDTContext,
  kDeclassified,
  kNoPrequestContext
};
...
Result GetRawPrivacyContext(const CensusHandle& h,
                            PrivacyContext* privacy_context);

Bulk operations

If possible, handle many items at once rather than just one at a time.

absl::flat_hash_map compares one hash byte per key from a group of keys using a single SIMD instruction.

See Swiss Table Design Notes and related CppCon 2017 and CppCon 2019 talks by Matt Kulukundis.

raw_hash_set.h

// Returns a bitmask representing the positions of slots that match hash.
BitMask<uint32_t> Match(h2_t hash) const {
  auto ctrl = _mm_loadu_si128(reinterpret_cast<const __m128i*>(pos));
  auto match = _mm_set1_epi8(hash);
  return BitMask<uint32_t>(_mm_movemask_epi8(_mm_cmpeq_epi8(match, ctrl)));
}

Do single operations to deal with many bytes and fix things up, rather than checking every byte what to do.

ordered-code.cc

int len = 0;
while (val > 0) {
  len++;
  buf[9 - len] = (val & 0xff);
  val >>= 8;
}
buf[9 - len - 1] = (unsigned char)len;
len++;
FastStringAppend(dest, reinterpret_cast<const char*>(buf + 9 - len), len);
BigEndian::Store(val, buf + 1);  // buf[0] may be needed for length
const unsigned int length = OrderedNumLength(val);
char* start = buf + 9 - length - 1;
*start = length;
AppendUpto9(dest, start, length + 1);

Improve Reed-Solomon processing speed by handling multiple interleaved input buffers more efficiently in chunks.

Run on (12 X 3501 MHz CPUs); 2016-09-27T16:04:55.065995192-04:00
CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_OneOutput/3/2                      466867    351818    +24.6%
BM_OneOutput/4/2                      563130    474756    +15.7%
BM_OneOutput/5/3                      815393    688820    +15.5%
BM_OneOutput/6/3                      897246    780539    +13.0%
BM_OneOutput/8/4                     1270489   1137149    +10.5%
BM_AllOutputs/3/2                     848772    642942    +24.3%
BM_AllOutputs/4/2                    1067647    638139    +40.2%
BM_AllOutputs/5/3                    1739135   1151369    +33.8%
BM_AllOutputs/6/3                    2045817   1456744    +28.8%
BM_AllOutputs/8/4                    3012958   2484937    +17.5%
BM_AllOutputsSetUpOnce/3/2            717310    493371    +31.2%
BM_AllOutputsSetUpOnce/4/2            833866    600060    +28.0%
BM_AllOutputsSetUpOnce/5/3           1537870   1137357    +26.0%
BM_AllOutputsSetUpOnce/6/3           1802353   1398600    +22.4%
BM_AllOutputsSetUpOnce/8/4           3166930   2455973    +22.4%

Decode four integers at a time (circa 2004).

Introduced a GroupVarInt format that encodes/decodes groups of 4 variable-length integers at a time in 5-17 bytes, rather than one integer at a time. Decoding one group of 4 integers in the new format takes ~1/3rd the time of decoding 4 individually varint-encoded integers.

groupvarint.cc

const char* DecodeGroupVar(const char* p, int N, uint32* dest) {
  assert(groupvar_initialized);
  assert(N % 4 == 0);
  while (N) {
    uint8 tag = *p;
    p++;

    uint8* lenptr = &groupvar_table[tag].length[0];

#define GET_NEXT                                        \
    do {                                                \
      uint8 len = *lenptr;                              \
      *dest = UNALIGNED_LOAD32(p) & groupvar_mask[len]; \
      dest++;                                           \
      p += len;                                         \
      lenptr++;                                         \
    } while (0)
    GET_NEXT;
    GET_NEXT;
    GET_NEXT;
    GET_NEXT;
#undef GET_NEXT

    N -= 4;
  }
  return p;
}

Encode groups of 4 k-bit numbers at a time.

Added KBitStreamEncoder and KBitStreamDecoder classes to encode/decode 4 k-bit numbers at a time into a bit stream. Since K is known at compile time, the encoding and decoding can be quite efficient. E.g., since four numbers are encoded at a time, the code can assume that the stream is always byte-aligned (for even k), or nibble-aligned (for odd k).

CLs that demonstrate multiple techniques

Sometimes a single CL contains a number of performance-improving changes that use many of the preceding techniques. Looking at the kinds of changes in these CLs is sometimes a good way to get in the mindset of making general changes to speed up the performance of some part of a system after that has been identified as a bottleneck.

Speed up GPU memory allocator by ~40%.

36-48% speedup in allocation/deallocation speed for GPUBFCAllocator:

  1. Identify chunks by a handle number, rather than by a pointer to a Chunk. Chunk data structures are now allocated in a vector<Chunk>, and a handle is an index into this vector to refer to a particular chunk. This allows the next and prev pointers in Chunk to be ChunkHandle (4 bytes), rather than Chunk* (8 bytes).

  2. When a Chunk object is no longer in use, we maintain a free list of Chunk objects, whose head is designated by ChunkHandle free_chunks_list_, and with the Chunk->next pointing to the next free list entry. Together with (1), this allows us to avoid heap allocation/deallocation of Chunk objects in the allocator, except (rarely) when the vector<Chunk> grows. It also makes all the memory for Chunk objects contiguous.

  3. Rather than having the bins_ data structure be a std::set and using lower_bound to locate the appropriate bin given a byte_size, we instead have an array of bins, indexed by a function that is log₂(byte_size/256). This allows the bin to be located with a few bit operations, rather than a binary search tree lookup. It also allows us to allocate the storage for all the Bin data structures in a contiguous array, rather than in many different cache lines. This reduces the number of cache lines that must be moved around between cores when multiple threads are doing allocations.

  4. Added fast path to GPUBFCAllocator::AllocateRaw that first tries to allocate memory without involving the retry_helper_. If an initial attempt fails (returns nullptr), then we go through the retry_helper_, but normally we can avoid several levels of procedure calls as well as the allocation/deallocation of a std::function with several arguments.

  5. Commented out most of the VLOG calls. These can be reenabled selectively when needed for debugging purposes by uncommenting and recompiling.

Added multi-threaded benchmark to test allocation under contention.

Speeds up ptb_word_lm on my desktop machine with a Titan X card from 8036 words per second to 8272 words per second (+2.9%).

Run on (40 X 2801 MHz CPUs); 2016/02/16-15:12:49
CPU: Intel Ivybridge with HyperThreading (20 cores) dL1:32KB dL2:256KB dL3:25MB
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_Allocation                            347       184    +47.0%
BM_AllocationThreaded/1                  351       181    +48.4%
BM_AllocationThreaded/4                 2470      1975    +20.0%
BM_AllocationThreaded/16               11846      9507    +19.7%
BM_AllocationDelayed/1                   392       199    +49.2%
BM_AllocationDelayed/10                  285       169    +40.7%
BM_AllocationDelayed/100                 245       149    +39.2%
BM_AllocationDelayed/1000                238       151    +36.6%

Speed up Pathways throughput by ~20% via a set of miscellaneous changes.

  • Unified a bunch of special fast descriptor parsing functions into a single ParsedDescriptor class and use this class in more places to avoid expensive full parse calls.

  • Change several protocol buffer fields from string to bytes (avoids unnecessary utf-8 checks and associated error handling code).

  • DescriptorProto.inlined_contents is now a string, not a Cord (it is expected to be used only for small-ish tensors). This necessitated the addition of a bunch of copying helpers in tensor_util.cc (need to now support both strings and Cords).

  • Use flat_hash_map instead of std::unordered_map in a few places.

  • Added MemoryManager::LookupMany for use by Stack op instead of calling Lookup per batch element. This change reduces setup overhead like locking.

  • Removed some unnecessary string creation in TransferDispatchOp.

  • Performance results for transferring a batch of 1000 1KB tensors from one component to another in the same process:

Before: 227.01 steps/sec
After:  272.52 steps/sec (+20% throughput)

~15% XLA compiler performance improvement through a series of changes.

Some changes to speed up XLA compilation:

  1. In SortComputationsByContent, return false if a == b in comparison function, to avoid serializing and fingerprinting long computation strings.

  2. Turn CHECK into DCHECK to avoid touching an extra cache line in HloComputation::ComputeInstructionPostOrder

  3. Avoid making an expensive copy of the front instruction in CoreSequencer::IsVectorSyncHoldSatisfied().

  4. Rework 2-argument HloComputation::ToString and HloComputation::ToCord routines to do the bulk of the work in terms of appending to std::string, rather than appending to a Cord.

  5. Change PerformanceCounterSet::Increment to just do a single hash table lookup rather than two.

  6. Streamline Scoreboard::Update code

Overall speedup of 14% in XLA compilation time for one important model.

Speed up low level logging in Google Meet application code.

Speed up ScopedLogId, which is on the critical path for each packet.

  • Removed the LOG_EVERY_N(ERROR, ...) messages that seemed to be there only to see if invariants were violated.
  • Inlined the PushLogId and PopLogid() routines (since without the LOG_EVERY_N_SECONDS(ERROR, ...) statements, they are now small enough to inline.
  • Switched to using a fixed array of size 4 and an 'int size' variable instead of an InlinedVector<...> for maintaining the thread local state. Since we never were growing beyond size 4 anyway, the InlinedVector's functionality was more general than needed.
Base: Baseline plus the code in scoped_logid_test.cc to add the benchmark
New: This changelist

CPU: Intel Ivybridge with HyperThreading (20 cores) dL1:32KB dL2:256KB dL3:25MB
Benchmark                                      Base (ns)    New (ns) Improvement
----------------------------------------------------------------------------
BM_ScopedLogId/threads:1                               8           4    +52.6%
BM_ScopedLogId/threads:2                               8           4    +51.9%
BM_ScopedLogId/threads:4                               8           4    +52.9%
BM_ScopedLogId/threads:8                               8           4    +52.1%
BM_ScopedLogId/threads:16                             11           6    +44.0%

Reduce XLA compilation time by ~31% by improving Shape handling.

Several changes to improve XLA compiler performance:

  1. Improved performance of ShapeUtil::ForEachIndex... iteration in a few ways:

    • In ShapeUtil::ForEachState, save just pointers to the arrays represented by the spans, rather than the full span objects.

    • Pre-form a ShapeUtil::ForEachState::indexes_span pointing at the ShapeUtil::ForEachState::indexes vector, rather than constructing this span from the vector on every loop iteration.

    • Save a ShapeUtil::ForEachState::indexes_ptr pointer to the backing store of the ShapeUtil::ForEachState::indexes vector, allowing simple array operations in ShapeUtil::ForEachState::IncrementDim(), rather than more expensive vector::operator[] operations.

    • Save a ShapeUtil::ForEachState::minor_to_major array pointer initialized in the constructor by calling shape.layout().minor_to_major().data() rather than calling LayoutUtil::Minor(...) for each dimension for each iteration.

    • Inlined the ShapeUtil::ForEachState constructor and the ShapeUtil::ForEachState::IncrementDim() routines

  2. Improved the performance of ShapeUtil::ForEachIndex iteration for call sites that don't need the functionality of returning a Status in the passed in function. Did this by introducing ShapeUtil::ForEachIndexNoStatus variants, which accept a ForEachVisitorFunctionNoStatus (which returns a plain bool). This is faster than the ShapeUtil::ForEachIndex routines, which accept a ForEachVisitorFunction (which returns a StatusOr<bool>, which requires an expensive StatusOr<bool> destructor call per element that we iterate over).

    • Used this variant of ShapeUtil::ForEachIndexNoStatus in LiteralBase::Broadcast and GenerateReduceOutputElement.
  3. Improved performance of LiteralBase::Broadcast in several ways:

    • Introduced templated BroadcastHelper routine in literal.cc that is specialized for different primitive byte sizes (without this, primitive_size was a runtime variable and so the compiler couldn't do a very good job of optimizing the memcpy that occurred per element, and would invoke the general memcpy path that assumes the byte count is fairly large, even though in our case it is a tiny power of 2 (typically 1, 2, 4, or 8)).

    • Avoided all but one of ~(5 + num_dimensions + num_result_elements) virtual calls per Broadcast call by making a single call to 'shape()' at the beginning of the LiteralBase::Broadcast routine. The innocuous looking 'shape()' calls that were sprinkled throughout end up boiling down to "root_piece().subshape()", where subshape() is a virtual function.

    • In the BroadcastHelper routine, Special-cased the source dimensions being one and avoided a call to IndexUtil::MultiDimensionalIndexToLinearIndex for this case.

    • In BroadcastHelper, used a scratch_source_array pointer variable that points into the backing store of the scratch_source_index vector, and used that directly to avoid vector::operator[] operations inside the per-element code. Also pre-computed a scratch_source_span that points to the scratch_source_index vector outside the per-element loop in BroadcastHelper, to avoid constructing a span from the vector on each element.

    • Introduced new three-argument variant of IndexUtil::MultiDimensionalIndexToLinearIndex where the caller passes in the minor_to_major span associated with the shape argument. Used this in BroadcastHelper to compute this for the src and dst shapes once per Broadcast, rather than once per element copied.

  4. In ShardingPropagation::GetShardingFromUser, for the HloOpcode::kTuple case, only call user.sharding().GetSubSharding(...) if we have found the operand to be of interest. Avoiding calling it eagerly reduces CPU time in this routine for one lengthy compilation from 43.7s to 2.0s.

  5. Added benchmarks for ShapeUtil::ForEachIndex and Literal::Broadcast and for the new ShapeUtil::ForEachIndexNoStatus.

Base is with the benchmark additions of
BM_ForEachIndex and BM_BroadcastVectorToMatrix (and BUILD file change to add
benchmark dependency), but no other changes.

New is this cl

Run on (72 X 1357.56 MHz CPU s) CPU Caches: L1 Data 32 KiB (x36)
L1 Instruction 32 KiB (x36) L2 Unified 1024 KiB (x36) L3 Unified 25344 KiB (x2)

Benchmark                                      Base (ns)    New (ns) Improvement
----------------------------------------------------------------------------
BM_MakeShape                                       18.40       18.90     -2.7%
BM_MakeValidatedShape                              35.80       35.60     +0.6%
BM_ForEachIndex/0                                  57.80       55.80     +3.5%
BM_ForEachIndex/1                                  90.90       85.50     +5.9%
BM_ForEachIndex/2                               1973606     1642197     +16.8%

The newly added ForEachIndexNoStatus is considerably faster than the ForEachIndex variant (it only exists in this new cl, but the benchmark work that is done by BM_ForEachIndexNoStatus/NUM is comparable to the BM_ForEachIndex/NUM results above).

Benchmark                                      Base (ns)    New (ns) Improvement
----------------------------------------------------------------------------
BM_ForEachIndexNoStatus/0                             0        46.90    ----
BM_ForEachIndexNoStatus/1                             0        65.60    ----
BM_ForEachIndexNoStatus/2                             0     1001277     ----

Broadcast performance improves by ~58%.

Benchmark                                      Base (ns)    New (ns) Improvement
----------------------------------------------------------------------------
BM_BroadcastVectorToMatrix/16/16                   5556        2374     +57.3%
BM_BroadcastVectorToMatrix/16/1024               319510      131075     +59.0%
BM_BroadcastVectorToMatrix/1024/1024           20216949     8408188     +58.4%

Macro results from doing ahead-of-time compilation of a large language model (program does more than just the XLA compilation, but spends a bit less than half its time in XLA-related code):

Baseline program overall: 573 seconds With this cl program overall: 465 seconds (+19% improvement)

Time spent in compiling the two largest XLA programs in running this program:

Baseline: 141s + 143s = 284s With this CL: 99s + 95s = 194s (+31% improvement)

Reduce compilation time for large programs by ~22% in Plaque (a distributed execution framework).

Small tweaks to speed up compilation by ~22%.

  1. Speed up detection of whether or not two nodes share a common source. Previously, we would get the sources for each node in sorted order and then do a sorted intersection. We now place the sources for one node in a hash-table and then iterate over the other node's sources checking the hash-table.
  2. Reuse the same scratch hash-table in step 1.
  3. When generating compiled proto, keep a single btree keyed by pair<package, opname> instead of a btree of btrees.
  4. Store pointer to opdef in the preceding btree instead of copying the opdef into the btree.

Measurement of speed on large programs (~45K ops):

name             old time/op  new time/op  delta
BM_CompileLarge   28.5s ± 2%   22.4s ± 2%  -21.61%  (p=0.008 n=5+5)

MapReduce improvements (~2X speedup for wordcount benchmark).

Mapreduce speedups:

  1. The combiner data structures for the SafeCombinerMapOutput class have been changed. Rather than using a hash_multimap<SafeCombinerKey, StringPiece>, which had a hash table entry for each unique key/value inserted in the table, we instead use a hash_map<SafeCombinerKey, ValuePtr*> (where ValuePtr is a linked list of values and repetition counts). This helps in three ways:

    • It significantly reduces memory usage, since we only use "sizeof(ValuePtr) + value_len" bytes for each value, rather than "sizeof(SafeCombinerKey) + sizeof(StringPiece) + value_len + new hash table entry overhead" for each value. This means that we flush the reducer buffer less often.

    • It's significantly faster, since we avoid extra hash table entries when we're inserting a new value for a key that already exists in the table (and instead we just hook the value into the linked list of values for that key).

    • Since we associate a repetition count with each value in the linked list, we can represent this sequence:

      Output(key, "1");
      Output(key, "1");
      Output(key, "1");
      Output(key, "1");
      Output(key, "1");
      

    as a single entry in the linked list for "key" with a repetition count of 5. Internally we yield "1" five times to the user-level combining function. (A similar trick could be applied on the reduce side, perhaps).

  2. (Minor) Added a test for "nshards == 1" to the default MapReductionBase::KeyFingerprintSharding function that avoids fingerprinting the key entirely if we are just using 1 reduce shard (since we can just return 0 directly in that case without examining the key).

  3. Turned some VLOG(3) statements into DVLOG(3) in the code path that is called for each key/value added to the combiner.

Reduces time for one wordcount benchmark from 12.56s to 6.55s.

Rework the alarm handling code in the SelectServer to significantly improve its performance (adding+removing an alarm from 771 ns to 271 ns).

Reworked the alarm handling code in the SelectServer to significantly improve its performance.

Changes:

  1. Switched to using AdjustablePriorityQueue<Alarm> instead of a a set<Alarm*> for the AlarmQueue. This significantly speeds up alarm handling, reducing the time taken to add and remove an alarm from 771 nanoseconds to 281 nanoseconds. This change avoids an allocation/deallocation per alarm setup (for the red-black tree node in the STL set object), and also gives much better cache locality (since the AdjustablePriorityQueue is a heap implemented in a vector, rather than a red-black tree), there are fewer cache lines touched when manipulating the AlarmQueue on every trip through the selectserver loop.

  2. Converted AlarmList in Alarmer from a hash_map to a dense_hash_map to avoid another allocation/deallocation per alarm addition/deletion (this also improves cache locality when adding/removing alarms).

  3. Removed the num_alarms_stat_ and num_closures_stat_ MinuteTenMinuteHourStat objects, and the corresponding exported variables. Although monitoring these seems nice, in practice they add significant overhead to critical networking code. If I had left these variables in as Atomic32 variables instead of MinuteTenMinuteHourStat, they would have still increased the cost of adding and removing alarms from 281 nanoseconds to 340 nanoseconds.

Benchmark results

Benchmark                      Time(ns)  CPU(ns) Iterations
-----------------------------------------------------------
BM_AddAlarm/1                       902      771     777777

With this change

Benchmark                      Time(ns)  CPU(ns) Iterations
-----------------------------------------------------------
BM_AddAlarm/1                       324      281    2239999

3.3X performance in index serving speed!

We found a number of performance issues when planning a switch from on-disk to in-memory index serving in 2001. This change fixed many of these problems and took us from 150 to over 500 in-memory queries per second (for a 2 GB in-memory index on dual processor Pentium III machine).

  • Lots of performance improvements to index block decoding speed (8.9 MB/s to 13.1 MB/s for a microbenchmark).
  • We now checksum the block during decoding. This allows us to implement all of our getsymbol operations to be done without any bounds checking.
  • We have grungy macros that hold the various fields of a BitDecoder in local variables over entire loops, and then store them back at the end of the loops.
  • We use inline assembly to get at the 'bsf' instruction on Intel chips for getUnary (finds index of first 1 bit in a word)
  • When decoding values into a vector, we resize the vector outside of the loop and just walk a pointer along the vector, rather than doing a bounds-checked access to store every value.
  • During docid decoding, we keep the docids in local docid space, to avoid multiplying by num_shards_. Only when we need the actual docid value do we multiply by num_shards_ and add my_shard_.
  • The IndexBlockDecoder now exports an interface 'AdvanceToDocid' that returns the index of the first docid ≥ "d". This permits the scanning to be done in terms of local docids, rather than forcing the conversion of each local docid to a global docid when the client calls GetDocid(index) for every index in the block.
  • Decoding of position data for documents is now done on demand, rather than being done eagerly for the entire block when the client asked for position data for any document within the block.
  • If the index block being decoded ends within 4 bytes of a page boundary, we copy it to a local buffer. This allows us to always load our bit decoding buffer via a 4-byte load, without having to worry about seg faults if we run off the end of a mmapped page.
  • We only initialize the first nterms_ elements of various scoring data structures, rather than initializing all MAX_TERMS of them (in some cases, we were unnecessarily memsetting 20K to 100K of data per document scored).
  • Avoid round_to_int and subsequent computation on intermediate scoring values when the value being computed is 0 (the subsequent computation was just writing '0' over the 0 that we had memset in these cases, and this was the most common case).
  • Made a bounds check on scoring data structures into a debug-mode assertion.

Further reading

In no particular order, a list of performance related books and articles that the authors have found helpful:

Suggested citation

If you want to cite this document, we suggest:

Jeffrey Dean & Sanjay Ghemawat, Performance Hints, 2025, https://abseil.io/fast/hints.html

Or in BibTeX:

@misc{DeanGhemawatPerformance2025,
  author = {Dean, Jeffrey and Ghemawat, Sanjay},
  title = {Performance Hints},
  year = {2025},
  howpublished = {\url{https://abseil.io/fast/hints.html}},
}

Acknowledgments

Many colleagues have provided helpful feedback on this document, including:

  • Adrian Ulrich
  • Alexander Kuzmin
  • Alexei Bendebury
  • Alexey Alexandrov
  • Amer Diwan
  • Austin Sims
  • Benoit Boissinot
  • Brooks Moses
  • Chris Kennelly
  • Chris Ruemmler
  • Danila Kutenin
  • Darryl Gove
  • David Majnemer
  • Dmitry Vyukov
  • Emanuel Taropa
  • Felix Broberg
  • Francis Birck Moreira
  • Gideon Glass
  • Henrik Stewenius
  • Jeremy Dorfman
  • John Dethridge
  • Kurt Kluever
  • Kyle Konrad
  • Lucas Pereira
  • Marc Eaddy
  • Michael Marty
  • Michael Whittaker
  • Mircea Trofin
  • Misha Brukman
  • Nicolas Hillegeer
  • Ranjit Mathew
  • Rasmus Larsen
  • Soheil Hassas Yeganeh
  • Srdjan Petrovic
  • Steinar H. Gunderson
  • Stergios Stergiou
  • Steven Timotius
  • Sylvain Vignaud
  • Thomas Etter
  • Thomas Köppe
  • Tim Chestnutt
  • Todd Lipcon
  • Vance Lankhaar
  • Victor Costan
  • Yao Zuo
  • Zhou Fang
  • Zuguang Yang