Performance Tip of the Week #21: Improving the efficiency of your regular expressions

Originally posted as Fast TotW #21 on January 16, 2020

By Paul Wankadia and Darryl Gove

Updated 2023-03-02


Regular expressions are used, misused and abused nearly everywhere. Google is no exception, alas, and at our scale, even a simple change can save hundreds or thousands of cores. In this tip, we describe ways to use RE2 more efficiently.

NOTE: This tip is specifically about RE2 and C++. A number of the ideas below are universally applicable, but discussion of other libraries and other languages is out of scope.

Using regular expressions: a representative sample

As a prelude, let’s consider an example of how regular expressions are often used. This snippet looks for a zone ID at the end of the zone_name string and extracts its value into the zone_id integer:

int zone_id;
if (RE2::FullMatch(zone_name, R"(.*\.zone(\d+))", &zone_id)) {

This tip describes several techniques for improving efficiency in situations such as this. These fall into two broad categories: improving the code that uses regular expressions; and improving the regular expressions themselves.

Writing more efficient code

A few words about RE2 objects

In order to understand why the following techniques matter, we need to talk briefly about RE2 objects. In the initial example, we passed a pattern string to RE2::FullMatch(). Passing a pattern string instead of an RE2 object implicitly constructs a temporary RE2 object. During construction, RE2 parses the pattern string to a syntax tree and compiles the syntax tree to an automaton. Depending on the complexity of the regular expression, construction can require a lot of CPU time and can build an automaton that will have a large memory footprint.

Use Abseil functions for literal strings

In many situations, regular expressions are unnecessary because simple string operations will suffice. For exact matching, absl::string_view defines operator==(). For substring, prefix and suffix matching, Abseil provides absl::StrContains(), absl::StartsWith() and absl::EndsWith(), respectively, in absl/strings/match.h. These are much faster than regular expressions and more readable, so using them where possible is recommended.

For example:

const RE2 re(absl::StrCat(row_key, ":.*"));
for (const auto& row : rows) {
  if (RE2::FullMatch(row, re)) {

could be rewritten as:

const std::string prefix = absl::StrCat(row_key, ":");
for (const auto& row : rows) {
  if (absl::StartsWith(row, prefix)) {

Minimise RE2 object churn

As discussed above, constructing RE2 objects can be expensive, so as a rule of thumb, they should be long-lived. Precompiling or caching them where possible is recommended.

Use LazyRE2 for static or global regular expressions

The RE2 class is unsafe for direct use with static or global regular expressions. Use LazyRE2 instead because it lazily constructs the underlying RE2 object and never destructs it.

The initial example could be rewritten as:

static constexpr LazyRE2 kZoneRe = {R"(.*\.zone(\d+))"};
int zone_id;
if (RE2::FullMatch(zone_name, *kZoneRe, &zone_id)) {

Writing more efficient regular expressions

Use RE2::PartialMatch() to avoid leading or trailing .*

Using RE2::FullMatch() with leading or trailing .* is an antipattern. Instead, change it to RE2::PartialMatch() and remove the .*. RE2::PartialMatch() performs an unanchored search, so it is also necessary to anchor the regular expression (i.e. with ^ or $) to indicate that it must match at the start or end of the string. Let’s look at some examples.

Replacing leading .*

For a regular expression with a leading .*, the leading .* should be removed and the regular expression should be anchored at the end with $. For example, .*-(?:bar|qux)-foo should become -(?:bar|qux)-foo$.

The leading .* prevents RE2 from terminating reverse execution (i.e. backwards from the end of the input string) after matching the last byte of interest. When the remainder of the input string is relatively large, RE2 has to do a lot more work for no benefit. More about that shortly…

The initial example could be rewritten further as:

static constexpr LazyRE2 kZoneRe = {R"(\.zone(\d+)$)"};
int zone_id;
if (RE2::PartialMatch(zone_name, *kZoneRe, &zone_id)) {

Replacing trailing .*

For a regular expression with a trailing .*, the trailing .* should be removed and the regular expression should be anchored at the start with ^. For example, foo-(?:bar|qux)-.* should become ^foo-(?:bar|qux)-. RE2 will detach the ^foo- prefix and match it with memcmp(3). (Note that this optimisation applies when the regular expression has ^ plus some literal string as its prefix–even when using RE2::FullMatch()!)

The trailing .* prevents RE2 from terminating execution after matching the last byte of interest. When the remainder of the input string is relatively large, RE2 has to do a lot more work for no benefit. More about that very shortly…

Replacing leading and trailing .*

For a regular expression with both a leading and trailing .*, both the leading and trailing .* should be removed. For example, .*-(?:bar|qux)-.* should become -(?:bar|qux)-.

The leading .* prevents RE2 from using memchr(3) to find the first byte of interest. (Note that this optimisation applies when the regular expression has one distinct first byte such as the f in foo\d+, but not when there are two or more possible first bytes such as the \d[0-9] in \d+bar.)

What .* really means

The problem with .* is that it doesn’t mean “match anything” by default. In fact, .[^\n] by default, so it matches any character that isn’t newline. RE2 defaults to UTF-8 encoding, so it builds an automaton that handles multibyte characters. Consequently, the default meaning of .* is “match zero or more multibyte characters that aren’t newline”. An automaton steps over the input string one byte at a time, so executing .* is much slower than using memchr(3) (which is typically implemented using SIMD) and it is infinitely slower than terminating execution as soon as the regular expression has been matched.

Minimise capturing groups and submatches

In situations where parentheses are necessary for grouping, use a non-capturing group (i.e. (?:)) unless a submatch is being extracted. Moreover, extract as few submatches as possible because execution engine selection depends in part on how many submatches the caller wants.

Passing nullptr for a no-op submatch is an antipattern. Instead, change the capturing group to a non-capturing group.

Use absl::string_view for submatches

When extracting a submatch, use absl::string_view. Extracting to std::string necessarily involves a string copy.

The same advice applies even when extracting a submatch with numeric conversion. Extracting to a numeric type also involves a string copy because strtol(3) et al. require NUL-terminated strings. Extracting to absl::string_view and calling absl::SimpleAtoi() et al. avoids the string copy.

The initial example could be rewritten even further as:

static constexpr LazyRE2 kZoneRe = {R"(\.zone(\d+)$)"};
absl::string_view zone_id_str;
int zone_id;
if (RE2::PartialMatch(zone_name, *kZoneRe, &zone_id_str) &&
    absl::SimpleAtoi(zone_id_str, &zone_id)) {

Minimise ambiguity

The “cost” of executing a regular expression depends greatly on ambiguity. See the comments in re2/ for details, but the rules can be summarised as:

  • it must be immediately obvious which branch of an alternation to take; and
  • it must be immediately obvious when a repetition ends.

For example, (.+)/ and (.+?)/ are ambiguous because . and / aren’t disjoint–a '/' byte could be consumed by either. In contrast, ([^/]+)/ is unambiguous because [^/] and / are disjoint. It isn’t always possible to eliminate ambiguity, but it is often possible to reduce it.

Exception: common prefixes of alternations

RE2 will factor common prefixes of alternations. For example, abacus|abalone parses as aba(?:cus|lone). This optimisation permits the regular expression to be written in a more readable manner. Note that it applies to contiguous runs, so lexicographically sorting the subexpressions is recommended.

Avoid counted repetition

A subexpression of the form x{n}, x{n,} or x{n,m} results in x being duplicated at least n or m times in the automaton. It’s bad enough (particularly when using Unicode character classes) for anchored matching, but it’s dangerous for unanchored matching because it’s easy to blow up the number of DFA states and their sizes. Each RE2 object has a memory budget (8 MiB by default) mostly for caching DFA states; exhausting the memory budget incurs a considerable performance hit.

For example, constructing the entire DFA needs 22 KiB for the anchored ^[a-q][^u-z]{13}x versus 6 MiB for the unanchored [a-q][^u-z]{13}x. The former isn’t useful for searching due to being anchored, but the latter isn’t efficient due to the exponential blowup in DFA states from being unanchored. Matching [a-q][^u-z]+x instead (i.e. using uncounted repetition) may be possible, but checking the length of the match in a subsequent step might not be acceptable.

Bonus: Matching multiple regular expressions

To finish, let’s touch on RE2::Set and FilteredRE2. These are two very different approaches to matching multiple regular expressions, which is useful in situations where performing multiple passes over the input string would be prohibitively expensive. Such situations include keyword matching and logs processing.

RE2::Set combines the regular expressions into one “many match” automaton. It has strengths and weaknesses similar to those of normal DFA execution and is suited to shorter, less complex regular expressions.

FilteredRE2 identifies the literal string “atoms” within the regular expressions; given which of those “atoms” occur within the input string, it determines which regular expressions potentially match. It requires an initial pass over the input string with something like Aho-Corasick or RE2::Set and is suited to longer, more complex regular expressions.

You might also like to read about lightgrep, Hyperscan and RE-tree, which were designed specifically for matching multiple regular expressions.

Subscribe to the Abseil Blog