<regex>
: Fix depth-first and leftmost-longest matching rules
#5218
+199
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #731 by fixing four bugs in the implementation of the ECMAScript (depth-first) and POSIX (leftmost-longest) matching rules.
Bug in ECMAScript matching rule implementation
ECMAScript prescribes depth-first matching. This effectively means: In a disjunction, try left alternative first; for a greedy quantifier, try greatest possible number of repetitions first; for a non-greedy quantifier, try smallest possible number of repetitions first. The matcher is supposed to yield the first match it finds using this strategy.
So why do we get incorrect match results in #731's test cases? Because the matcher deviates from ECMAScript evaluation rules: An optimization has been implemented in
_Matcher::_Do_rep0()
, the main purpose of which seems to be to reduce stack usage. This optimization essentially tries smallest number of repetitions first for greedy quantifiers as well, so the first matches are discovered as if a non-greedy quantifier is used. Consider the test case from #731 (comment) where "AAA BBB" is to be matched with regular expression(A+)\s*(B+)?\s*B*
. If quantifiers are processed smallest-first, then the first discovered full match is the following:(A+)
matches "AAA", first\s*
matches "" (smallest number of repetitions),(\B+)?
matches "", second\s*
matches " " andB*
matches "BBB". This is the wrong result produced byregex_match()
in the test case.However, the implementation of
_Matcher::_Do_rep0()
is somewhat aware that it is matching using the wrong strategy for greedy quantifiers: It doesn't just immediately stop matching when it encounters a first match, but keeps trying to add more repetitions until matching fails. Thus, the matcher will find the correct result as well:(A+)
matches "AAA", first\s*
matches " " (maximal number of repetitions),(\B+)?
matches "BBB" (maximal number of repetitions), second\s*
matches "" andB*
matches "".But when a new match is finally recognized when processing the
_N_end
node in_Matcher::_Match_pat()
, there is no clear way to tell whether this one should have been encountered before during depth-first search. The test in_Matcher::_Better_match()
is certainly wrong for this purpose, since it's implementing leftmost-longest rule (but incorrectly, see below), not the depth-first rule. One needs information about the path taken through the NFA to make this determination correctly, but the required information is buried somewhere in the call stack.This observation suggests the fix: In ECMAScript mode, all recursively called functions in the matcher except
_Matcher::_Do_rep0()
return immediately when a match has been encountered, so the first match is described by_Tgt_state
._Matcher::_Do_rep0()
does not return immediately for greedy quantifiers, since it's processing in smallest-first order, but it keeps track of the value of_Tgt_state
for the greatest number of successfully matching repetitions in the local variable_Final
and finally assigns_Final
to_Tgt_state
before it returns. So this function also ensures that_Tgt_state
describes the first match encountered under depth-first matching rules.So to make this long story short: The fix is that
_Matcher::_Match()
must construct the final result from_Tgt_state
in ECMAScript mode, not_Res
.Since
_Res
is effectively no longer used and_Matcher::_Better_match()
is only relevant for matching according to the leftmost-longest rule, we can also skip the assignment to_Res
and the call of_Matcher::_Better_match()
in_Matcher::_Match_pat()
when depth-first matching is applied.Two bugs in
_Matcher::_Better_match()
's implementation of leftmost-longest matching ruleInterestingly, POSIX regular expressions yield the same wrong result for the (equivalent) regular expression
(A+)[[:space:]]*(B+)?[[:space:]]*B*
: Matching "AAA BBB", the first capture group is "AAA" and the second is unmatched, even though the second capture group should have been "BBB". This is due to bugs in_Matcher::_Better_match()
which were already observed in #1547 (comment):_Better_match()
. After adding this rule, "AAA BBB" gets matched correctly by regular expression(A+)[[:space:]]*(B+)?[[:space:]]*B*
._Better_match()
kind of implemented the opposite rule, "rightmost-longest", because<
was used instead of>
. Matching "aabaac" with(aa|aabaac|ba|b|c)*
exposes the bug: The leftmost-longest match for the first capture group is "aabaac", but "c" is returned without this PR because it is the right-most choice of the first capture group when the whole string is matched. (First repetition matches "aa", second matches "b", third matches "aa", and finally the last one matches "c", which is assigned to the capture group.)Bug in
_Matcher::_Do_rep()
's implementation of leftmost-longest matching ruleWithout this PR,
_Matcher::_Do_rep()
only performs greedy matching when searching for a match under the leftmost-longest rule: When it finds a match with some number of repetitions, it does not check whether a match with fewer repetitions is better (e.g., longer) under the leftmost-longest rule. After this PR,_Matcher::_Do_rep()
now checks any possible number of repetitions.Remarks
MSVC STL's matcher uses the same interpretation of leftmost-longest as Boost (leftmost-longest capture groups, where the capture groups refer to the last matched strings), but the implementation has been quite buggy. This PR fixes the bugs in the implementation, so that the match results now agree with Boost's on the added test cases.
When applying the leftmost-longest rule, we still don't agree with libstdc++ (which sometimes seems to produce results closer to depth-first matching) and libc++ (namely for regular expression
(aa|aabaac|ba|b|c)*
as well as^[[:blank:]]*#([^\\n]*\\\\[[:space:]]+)*[^\\n]*
in awk mode and its sibling in extended mode) on some added test cases. I haven't checked, but there is a good chance that they implement some different interpretation of the leftmost-longest rule. The POSIX standard is surprisingly terse in describing leftmost-longest matching, leaving much room for interpretation.