Originally posted as Fast TotW #21 on January 16, 2020
By Paul Wankadia and Darryl Gove
Updated 2023-03-02
Quicklink: abseil.io/fast/21
Regular expressions are used, misused and abused nearly everywhere. Google is no exception, alas, and at our scale, even a simple change can save hundreds or thousands of cores. In this tip, we describe ways to use RE2 more efficiently.
NOTE: This tip is specifically about RE2 and C++. A number of the ideas below are universally applicable, but discussion of other libraries and other languages is out of scope.
As a prelude, let’s consider an example of how regular expressions are often
used. This snippet looks for a zone ID at the end of the zone_name
string and
extracts its value into the zone_id
integer:
int zone_id; if (RE2::FullMatch(zone_name, R"(.*\.zone(\d+))", &zone_id)) {
This tip describes several techniques for improving efficiency in situations such as this. These fall into two broad categories: improving the code that uses regular expressions; and improving the regular expressions themselves.
RE2
objectsIn order to understand why the following techniques matter, we need to talk
briefly about RE2
objects. In the initial example, we passed a pattern string
to RE2::FullMatch()
. Passing a pattern string instead of an RE2
object
implicitly constructs a temporary RE2
object. During construction, RE2 parses
the pattern string to a syntax tree and compiles the syntax tree to an
automaton. Depending on the complexity of the regular expression, construction
can require a lot of CPU time and can build an automaton that will have a large
memory footprint.
In many situations, regular expressions are unnecessary because simple string
operations will suffice. For exact matching, absl::string_view
defines
operator==()
. For substring, prefix and suffix matching, Abseil provides
absl::StrContains()
, absl::StartsWith()
and absl::EndsWith()
,
respectively, in
absl/strings/match.h.
These are much faster than regular expressions and more readable, so using them
where possible is recommended.
For example:
const RE2 re(absl::StrCat(row_key, ":.*")); for (const auto& row : rows) { if (RE2::FullMatch(row, re)) {
could be rewritten as:
const std::string prefix = absl::StrCat(row_key, ":"); for (const auto& row : rows) { if (absl::StartsWith(row, prefix)) {
RE2
object churnAs discussed above, constructing RE2
objects can be expensive, so as a rule of
thumb, they should be long-lived. Precompiling or caching them where possible is
recommended.
LazyRE2
for static or global regular expressionsThe RE2
class is unsafe for direct use with static or global regular
expressions. Use LazyRE2
instead because it lazily constructs the underlying
RE2
object and never destructs it.
The initial example could be rewritten as:
static constexpr LazyRE2 kZoneRe = {R"(.*\.zone(\d+))"}; int zone_id; if (RE2::FullMatch(zone_name, *kZoneRe, &zone_id)) {
RE2::PartialMatch()
to avoid leading or trailing .*
Using RE2::FullMatch()
with leading or trailing .*
is an antipattern.
Instead, change it to RE2::PartialMatch()
and remove the .*
.
RE2::PartialMatch()
performs an unanchored search, so it is also necessary to
anchor the regular expression (i.e. with ^
or $
) to indicate that it must
match at the start or end of the string. Let’s look at some examples.
.*
For a regular expression with a leading .*
, the leading .*
should be removed
and the regular expression should be anchored at the end with $
. For example,
.*-(?:bar|qux)-foo
should become -(?:bar|qux)-foo$
.
The leading .*
prevents RE2 from terminating reverse execution (i.e. backwards
from the end of the input string) after matching the last byte of interest. When
the remainder of the input string is relatively large, RE2 has to do a lot more
work for no benefit. More about that shortly…
The initial example could be rewritten further as:
static constexpr LazyRE2 kZoneRe = {R"(\.zone(\d+)$)"}; int zone_id; if (RE2::PartialMatch(zone_name, *kZoneRe, &zone_id)) {
.*
For a regular expression with a trailing .*
, the trailing .*
should be
removed and the regular expression should be anchored at the start with ^
. For
example, foo-(?:bar|qux)-.*
should become ^foo-(?:bar|qux)-
. RE2 will detach
the ^foo-
prefix and match it with memcmp(3)
. (Note that this optimisation
applies when the regular expression has ^
plus some literal string as its
prefix–even when using RE2::FullMatch()
!)
The trailing .*
prevents RE2 from terminating execution after matching the
last byte of interest. When the remainder of the input string is relatively
large, RE2 has to do a lot more work for no benefit. More about that very
shortly…
.*
For a regular expression with both a leading and trailing .*
, both the leading
and trailing .*
should be removed. For example, .*-(?:bar|qux)-.*
should
become -(?:bar|qux)-
.
The leading .*
prevents RE2 from using memchr(3)
to find the first byte of
interest. (Note that this optimisation applies when the regular expression has
one distinct first byte such as the f
in foo\d+
, but not when there are two
or more possible first bytes such as the \d
≡ [0-9]
in \d+bar
.)
.*
really meansThe problem with .*
is that it doesn’t mean “match anything” by default. In
fact, .
≡ [^\n]
by default, so it matches any character that isn’t newline.
RE2 defaults to UTF-8 encoding, so it builds an automaton that handles multibyte
characters. Consequently, the default meaning of .*
is “match zero or more
multibyte characters that aren’t newline”. An automaton steps over the input
string one byte at a time, so executing .*
is much slower than using
memchr(3)
(which is typically implemented using SIMD) and it is infinitely
slower than terminating execution as soon as the regular expression has been
matched.
In situations where parentheses are necessary for grouping, use a non-capturing
group (i.e. (?:
… )
) unless a submatch is being extracted. Moreover,
extract as few submatches as possible because execution engine selection depends
in part on how many submatches the caller wants.
Passing nullptr
for a no-op submatch is an antipattern. Instead, change the
capturing group to a non-capturing group.
absl::string_view
for submatchesWhen extracting a submatch, use absl::string_view
. Extracting to std::string
necessarily involves a string copy.
The same advice applies even when extracting a submatch with numeric conversion.
Extracting to a numeric type also involves a string copy because strtol(3)
et
al. require NUL-terminated strings. Extracting to absl::string_view
and
calling absl::SimpleAtoi()
et al. avoids the string copy.
The initial example could be rewritten even further as:
static constexpr LazyRE2 kZoneRe = {R"(\.zone(\d+)$)"}; absl::string_view zone_id_str; int zone_id; if (RE2::PartialMatch(zone_name, *kZoneRe, &zone_id_str) && absl::SimpleAtoi(zone_id_str, &zone_id)) {
The “cost” of executing a regular expression depends greatly on ambiguity. See the comments in re2/onepass.cc for details, but the rules can be summarised as:
For example, (.+)/
and (.+?)/
are ambiguous because .
and /
aren’t
disjoint–a '/'
byte could be consumed by either. In contrast, ([^/]+)/
is
unambiguous because [^/]
and /
are disjoint. It isn’t always possible to
eliminate ambiguity, but it is often possible to reduce it.
RE2 will factor common prefixes of alternations. For example, abacus|abalone
parses as aba(?:cus|lone)
. This optimisation permits the regular expression to
be written in a more readable manner. Note that it applies to contiguous runs,
so lexicographically sorting the subexpressions is recommended.
A subexpression of the form x{n}
, x{n,}
or x{n,m}
results in x
being
duplicated at least n
or m
times in the automaton. It’s bad enough
(particularly when using Unicode character classes) for anchored matching, but
it’s dangerous for unanchored matching because it’s easy to blow up the number
of DFA states and their sizes. Each RE2
object has a memory budget (8 MiB by
default) mostly for caching DFA states; exhausting the memory budget incurs a
considerable performance hit.
For example, constructing the entire DFA needs 22 KiB for the anchored
^[a-q][^u-z]{13}x
versus 6 MiB for the unanchored [a-q][^u-z]{13}x
. The
former isn’t useful for searching due to being anchored, but the latter isn’t
efficient due to the exponential blowup in DFA states from being unanchored.
Matching [a-q][^u-z]+x
instead (i.e. using uncounted repetition) may be
possible, but checking the length of the match in a subsequent step might not be
acceptable.
To finish, let’s touch on RE2::Set
and FilteredRE2
. These are two very
different approaches to matching multiple regular expressions, which is useful
in situations where performing multiple passes over the input string would be
prohibitively expensive. Such situations include keyword matching and logs
processing.
RE2::Set
combines the regular expressions into one “many match” automaton. It
has strengths and weaknesses similar to those of normal DFA execution and is
suited to shorter, less complex regular expressions.
FilteredRE2
identifies the literal string “atoms” within the regular
expressions; given which of those “atoms” occur within the input string, it
determines which regular expressions potentially match. It requires an initial
pass over the input string with something like Aho-Corasick or RE2::Set
and is
suited to longer, more complex regular expressions.
You might also like to read about lightgrep, Hyperscan and RE-tree, which were designed specifically for matching multiple regular expressions.