absl::Hash
By Chris Kennelly, Google Software Engineer
We are happy to announce the arrival of TCMalloc, a fast memory allocator with useful profiling and introspection features.
The source code can be found on Github. This is a distinct repository, allowing Abseil to be used independently of TCMalloc. The library includes:
This blog post covers getting started and a peek at one of our key optimization techniques: per-CPU caches. For a more detailed explanation of the design of TCMalloc and its features, take a look at our design note.
Adding TCMalloc to a binary involves specifying it as the malloc
attribute of
a Bazel cc_binary
or cc_test
rule. For example, with
Abseil hello:
cc_binary(
name = "hello_main",
srcs = ["hello_main.cc"],
deps = [
":hello",
],
malloc = "@com_google_tcmalloc//tcmalloc",
)
We can also leverage the telemetry TCMalloc provides into our
heap with MallocExtension
.
#include <iostream>
#include “tcmalloc/malloc_extension.h”
int main(int argc, char** argv) {
absl::optional<size_t> heap_size =
tcmalloc::MallocExtension::GetNumericProperty(
"generic.current_allocated_bytes");
if (heap_size.has_value()) {
std::cout << "heap size = " << *heap_size << " bytes" << std::endl;
}
}
MallocExtension
is a separate library from TCMalloc, allowing it to be
used when another malloc implementation is linked-in, for example, when
using C++ sanitizers. The library is crafted so that although the
telemetry and controls it provides will be inoperative, the code using
it will still link and compile.
TCMalloc’s name comes from “thread caching” malloc. These caches allow us to allocate memory most of the time without taking locks, falling back to a more centralized cache when the thread cache is exhausted or full.
The cost of per-thread caches has increased as thread counts in applications have grown. Because we do not allow one thread to access another’s thread cache, to avoid locks, we need to carefully balance between ensuring a cache is large enough to hold needed objects and avoiding stranding memory in an idle cache. Objects held by idle threads are effectively wasted.
Our per-CPU caches take advantage of the fact that only one thread can run on a single core at any given time and that pre-emptions during the critical sequence of an allocation are relatively rare. These caches are based on the Linux kernel’s restartable sequences (RSEQ) feature, developed by Paul Turner and Andrew Hunter at Google and Mathieu Desnoyers at EfficiOS. Restartable sequences let us execute a region of code atomically and restart if we are interrupted by the kernel.
RSEQ allows us to share a cache across many interleaving threads of execution, improving the usefulness of the cache. When one thread is idle, another thread can still allocate from the core’s cache. Additionally, because the number of caches is bounded by the total number of cpus, we can allocate an array-based metadata structure for each cache. This allows us to reduce memory indirections, common in the linked-list data structures we use for our thread-cache implementation.
For systems without this feature, we fall back to a per-thread cache implementation.
For more information, consult the following documentation:
malloc()
, free()
,
and ::operator new