-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix better_match check used by regex search routine #1547
Fix better_match check used by regex search routine #1547
Conversation
stl/inc/regex
Outdated
} | ||
return false; | ||
|
||
// more captured groups means a better match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the following concerns:
- Where is this mandated by ECMAScript? I am wary of adding logic to make this specific scenario work, without knowing exactly what Standardese it corresponds to.
- Can we construct test cases that stress this rule (i.e. cases where it activates and causes different choices to be made), and compare the observed behavior of libc++, libstdc++, and browsers?
- Does this affect the other grammars, and are they all specified to behave the same way?
In Boost's documentation (which is of course non-Standard, but more comprehensible than ECMAScript/POSIX/etc.) I see that the Perl/ECMAScript behavior is described as "the first match found by a depth-first-search" and "leftmost", while basic/extended behavior is "leftmost longest".
Thanks for additional input, @StephanTLavavej!
With following results: msvc (without fix):
msvc (with fix), libc++ HEAD 12.0 (Wandbox), libstdc++ HEAD 11.0.0 (Wandbox) and browser (regex101.com) produce same results:
Also, based on mentioned The Leftmost Longest Rule it seems that step 6 there covers an extension of capturing with additional groups, if I understand that statement correctly. |
b565824
to
af26bd4
Compare
The boost documentation links "leftmost" with "leftmost longest", and so I do believe that ECMAScript also follows the leftmost longest rule. However, I believe that this is an incorrect check. Leftmost longest does not consider how many groups are matched; only, from front to back, the length of each group and its location. The issue with the current check is not that it doesn't check how many groups are matched - it is that it doesn't consider "non-matching" groups to be worse than matching groups. As I was working on this, I also noticed a bug in if (_Res._Grps[_Ix]._Begin != _Tgt_state._Grps[_Ix]._Begin) {
return _STD distance(_Begin, _Res._Grps[_Ix]._Begin)
< _STD distance(_Begin, _Tgt_state._Grps[_Ix]._Begin);
} should be if (_Res._Grps[_Ix]._Begin != _Tgt_state._Grps[_Ix]._Begin) {
return _STD distance(_Begin, _Tgt_state._Grps[_Ix]._Begin)
< _STD distance(_Begin, _Res._Grps[_Ix]._Begin);
} With this bug fixed, we can find a place where the code you've written fails: regex = The correct match for this should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tr1/tests/regex2
is failing:
Standard Output:
--
Testing ECMAScript
Testing UNIX BRE
Testing grep
Testing UNIX ERE
GOT "2 0 6 0 6" != "2 0 6 5 6"
FAIL test 1213 at line 662 in C:\a\1\s\tests\tr1\tests\regex2\test.cpp: (aa|aabaac|ba|b|c)* |aabaac|
Testing egrep
GOT "2 0 6 0 6" != "2 0 6 5 6"
FAIL test 1557 at line 662 in C:\a\1\s\tests\tr1\tests\regex2\test.cpp: (aa|aabaac|ba|b|c)* |aabaac|
Testing awk
GOT "2 0 6 0 6" != "2 0 6 5 6"
FAIL test 1902 at line 662 in C:\a\1\s\tests\tr1\tests\regex2\test.cpp: (aa|aabaac|ba|b|c)* |aabaac|
FINISHED testing <regex>, part 2
***** 3 erroneous test cases in <regex>, part 2 *****
***** 1980 successful test cases in <regex>, part 2 *****
#FAILED: <regex>, part 2
--
} | ||
if (_Tgt_grp._Begin == _Tgt_grp._End) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We conventionally add newlines between "non-chained" if
-statements.
if (_Res_grp._Begin != _Tgt_grp._Begin) { | ||
ptrdiff_t _Res_dist_from_begin = _STD distance(_Begin, _Res_grp._Begin); | ||
ptrdiff_t _Tgt_dist_from_begin = _STD distance(_Begin, _Tgt_grp._Begin); | ||
if (_Sflags & regex_constants::ECMAScript) { | ||
// If one of the matches starts before the other, in ECMAScript, it is considered better. | ||
// This is also known as "leftmost", in languages written from left to right. | ||
return _Tgt_dist_from_begin < _Res_dist_from_begin; | ||
} else { | ||
// Otherwise, Unix uses "rightmost" rules. | ||
return _Res_dist_from_begin < _Tgt_dist_from_begin; | ||
} | ||
} | ||
|
||
if (_Res_grp._End != _Tgt_grp._End) { | ||
// If one of the matches is longer than the other, and they start at the same point, | ||
// it is considered better. | ||
ptrdiff_t _Res_size = _STD distance(_Res_grp._Begin, _Res_grp._End); | ||
ptrdiff_t _Tgt_size = _STD distance(_Tgt_grp._Begin, _Tgt_grp._End); | ||
return _Tgt_size > _Res_size; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this logic is incorrect (although I don't immediately know what all of the problems are).
First, although it is not an authoritative reference, https://en.cppreference.com/w/cpp/regex/syntax_option_type describes POSIX as "leftmost longest". I'm surprised to see "rightmost" mentioned in this implementation. (From what I've read, ECMAScript is not "leftmost longest".) Boost explains the Perl/ECMAScript behavior at https://www.boost.org/doc/libs/1_80_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.what_gets_matched and the POSIX behavior at https://www.boost.org/doc/libs/1_80_0/libs/regex/doc/html/boost_regex/syntax/leftmost_longest_rule.html .
Second, the cppreference article depicts the output:
Searching for .*(a|xayy) in zzxayyzz:
ECMA (depth first search) match: zzxa
POSIX (leftmost longest) match: zzxayy
However, both main
and this PR print:
Searching for .*(a|xayy) in zzxayyzz:
ECMA (depth first search) match: zzxayy
POSIX (leftmost longest) match: zzxayy
In this example with this PR, the final if
-statement is saying "they start at the same point, so prefer the longer one".
(As an aside, this PR's logic is structured and commented in a clear, comprehensible way, which is great! My only concern is matching what the grammars specify.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yeah, I apparently misunderstood something as I was cleaning up the logic; I think the existing logic was worse, but yeah; not sure what the best way forward is.
@@ -116,6 +116,59 @@ class regex_fixture { | |||
} | |||
} | |||
|
|||
void should_capture_groups(const std::string& subject, const std::string& pattern, | |||
const std::vector<std::string>& groups, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to include <vector>
.
} | ||
|
||
if (m[0] != subject) { | ||
printf(R"(should_capture("%s", "%s"): not a full match)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Occurs below) These messages are saying should_capture
, but this function is should_capture_groups
and takes additional arguments. If we fail, I think all of them should print out the groups, and also the value of flags
as an int
.
std::string groups_string = "|"; | ||
for (const auto& el : groups) { | ||
groups_string.append(el); | ||
groups_string.push_back('|'); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: This seems confusing, because it doesn't look like the callsites. |
means "alternations" for regexes, but here you're using it with an unofficial meaning to separate capture groups.
Would formatting the groups as {"group1","group2","group3"}
be better?
@@ -546,6 +546,17 @@ void test_VSO_226914_word_boundaries() { | |||
aWordAny.should_search_fail("aa", match_not_bow | match_not_eow); | |||
} | |||
|
|||
void test_gh_731() { | |||
std::vector<std::string> groups{"AAA", "BBB", ""}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is already using namespace std;
. Occurs below.
@@ -546,6 +546,17 @@ void test_VSO_226914_word_boundaries() { | |||
aWordAny.should_search_fail("aa", match_not_bow | match_not_eow); | |||
} | |||
|
|||
void test_gh_731() { | |||
std::vector<std::string> groups{"AAA", "BBB", ""}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably include <vector>
even if <test_regex_support.hpp>
already does so.
@@ -546,6 +546,17 @@ void test_VSO_226914_word_boundaries() { | |||
aWordAny.should_search_fail("aa", match_not_bow | match_not_eow); | |||
} | |||
|
|||
void test_gh_731() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be followed by a comment mentioning the bug's title (like test_gh_993()
immediately below).
std::vector<std::string> groups{"AAA", "BBB", ""}; | ||
g_regexTester.should_capture_groups("AAA BBB", R"((A+)\s*(B+)?\s*B*(B*)?)", groups); | ||
|
||
groups.assign({"AAA", "BBB"}); | ||
g_regexTester.should_capture_groups("AAA BBB", R"((A+)\s*(B+)?\s*B*)", groups); | ||
|
||
groups.assign({"c"}); | ||
g_regexTester.should_capture_groups("aabaac", "(aa|aabaac|ba|b|c)*", groups, std::regex_constants::extended); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need additional test coverage to exercise _Better_match
, especially scenarios where ECMAScript and POSIX Extended accept the same syntax but have different ideas about what should be captured. I think we should also manually verify agreement with Boost/libstdc++/libc++, or understand any remaining divergence. (e.g. I observed MSVC, libstdc++, and libc++ getting 3 sets of answers for the subject "WXconcatenateYZ"
with the pattern ".*(cat|concatenate)"
https://godbolt.org/z/bza5nrYcM , woof 🐶).
We talked about this at the weekly maintainer meeting. Although this PR in its current state does fix the original scenario (thanks @strega-nil-ms!), there are remaining scenarios that are unfixed, and we're uncertain about the full extent of the problem. We believe it's highly risky to merge changes that alter runtime behavior (in an old and widely-used component) without fully understanding the problems and having high confidence that the fix is both correct and complete. We'll keep the original bug #731 open, since this is a conformance bug that we'd like to fix someday (probably through a complete overhaul in the vNext timeframe), but at the moment we can't merge this PR as-is. |
That is an attempt to fix the issue #731 by taking a number of captured groups into consideration during performing better_match check.