Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix better_match check used by regex search routine #1547

Closed

Conversation

cpp11nullptr
Copy link
Member

That is an attempt to fix the issue #731 by taking a number of captured groups into consideration during performing better_match check.

@cpp11nullptr cpp11nullptr requested a review from a team as a code owner December 22, 2020 02:07
@StephanTLavavej StephanTLavavej added the bug Something isn't working label Jan 5, 2021
stl/inc/regex Outdated
}
return false;

// more captured groups means a better match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the following concerns:

  1. Where is this mandated by ECMAScript? I am wary of adding logic to make this specific scenario work, without knowing exactly what Standardese it corresponds to.
  2. Can we construct test cases that stress this rule (i.e. cases where it activates and causes different choices to be made), and compare the observed behavior of libc++, libstdc++, and browsers?
  3. Does this affect the other grammars, and are they all specified to behave the same way?

In Boost's documentation (which is of course non-Standard, but more comprehensible than ECMAScript/POSIX/etc.) I see that the Perl/ECMAScript behavior is described as "the first match found by a depth-first-search" and "leftmost", while basic/extended behavior is "leftmost longest".

@cpp11nullptr
Copy link
Member Author

cpp11nullptr commented Jan 17, 2021

Thanks for additional input, @StephanTLavavej!

  1. I could not find precise description what behaviour should be in such scenarios but this PR should correct a capturing for cases, such as (A+)\s*(B+)?\s*B*, when for "AAA BBB" target there is firstly found (A+)..B* match with only (A+) group captured with writing to _Res with no way to overwrite it even if (A+)..(B+) match (with two captured groups) found. Perhaps, it is because of implementation details and, as alternative, a logic could be changed to make (A+)..(B+) match found firstly (but it's not so robust).
  2. Good point, I added few more examples:
#include <iostream>
#include <regex>
#include <string>

void find_match(const std::string& title, const std::string& rgstr, const std::string& str)
{
    std::cout << title << ": ";

    const std::regex rg{ rgstr };
    std::smatch match{};
    if (!std::regex_match(str, match, rg))
    {
        std::cout << "No match." << std::endl;

        return;
    }

    for (size_t i{ 1 }; i < match.size(); ++i)
    {
        std::cout << "[" << i << "] = " << match[i] << " ";
    }

    std::cout << std::endl;
}

int main()
{
    using namespace std::string_literals;

    find_match("Case 0"s, R"((A+)\s*(B+)?\s*B*)"s, "AAA BBB"s);
    find_match("Case 1"s, R"((A+)\s*B*)"s, "AAA BBB"s);
    find_match("Case 2"s, R"((A+)\s*B*(B+))"s, "AAA BBB"s);
    find_match("Case 3"s, R"((A+)\s*(B+)?\s*B*(B+))"s, "AAA BBB"s);

    return 0;
}

With following results:

msvc (without fix):

Case 0: [1] = AAA [2] =
Case 1: [1] = AAA
Case 2: [1] = AAA [2] = B
Case 3: [1] = AAA [2] =  [3] = B

msvc (with fix), libc++ HEAD 12.0 (Wandbox), libstdc++ HEAD 11.0.0 (Wandbox) and browser (regex101.com) produce same results:

Case 0: [1] = AAA [2] = BBB
Case 1: [1] = AAA
Case 2: [1] = AAA [2] = B
Case 3: [1] = AAA [2] = BB [3] = B
  1. I don't think that other grammars should be affected by this change as _Better_match() is used only in context when a decision whether new match should replace already found one should be made (after finishing lookup/sub-lookup during processing _N_end node).

Also, based on mentioned The Leftmost Longest Rule it seems that step 6 there covers an extension of capturing with additional groups, if I understand that statement correctly.

Base automatically changed from master to main January 28, 2021 00:35
@StephanTLavavej StephanTLavavej self-assigned this Apr 28, 2021
@strega-nil-ms strega-nil-ms force-pushed the fix-regex-better-match-check branch from b565824 to af26bd4 Compare August 23, 2022 15:51
@strega-nil-ms
Copy link
Contributor

The boost documentation links "leftmost" with "leftmost longest", and so I do believe that ECMAScript also follows the leftmost longest rule.

However, I believe that this is an incorrect check. Leftmost longest does not consider how many groups are matched; only, from front to back, the length of each group and its location. The issue with the current check is not that it doesn't check how many groups are matched - it is that it doesn't consider "non-matching" groups to be worse than matching groups.

As I was working on this, I also noticed a bug in _Better_match() - it considers the rightmost match to be better:

            if (_Res._Grps[_Ix]._Begin != _Tgt_state._Grps[_Ix]._Begin) {
                return _STD distance(_Begin, _Res._Grps[_Ix]._Begin)
                     < _STD distance(_Begin, _Tgt_state._Grps[_Ix]._Begin);
            }

should be

            if (_Res._Grps[_Ix]._Begin != _Tgt_state._Grps[_Ix]._Begin) {
                return _STD distance(_Begin, _Tgt_state._Grps[_Ix]._Begin)
                     < _STD distance(_Begin, _Res._Grps[_Ix]._Begin);
            }

With this bug fixed, we can find a place where the code you've written fails:

regex = /(A+)\s*(B+)?\s*B*(B*)?/
string = AAA BBB

The correct match for this should be |AAA|BBB|-|, however, your code gives |AAA|-|BBB|.

Copy link
Member

@StephanTLavavej StephanTLavavej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tr1/tests/regex2 is failing:
Standard Output:
--
Testing ECMAScript
Testing UNIX BRE
Testing grep
Testing UNIX ERE
 GOT "2 0 6 0 6" != "2 0 6 5 6"
 FAIL test 1213 at line 662 in C:\a\1\s\tests\tr1\tests\regex2\test.cpp: (aa|aabaac|ba|b|c)* |aabaac|
Testing egrep
 GOT "2 0 6 0 6" != "2 0 6 5 6"
 FAIL test 1557 at line 662 in C:\a\1\s\tests\tr1\tests\regex2\test.cpp: (aa|aabaac|ba|b|c)* |aabaac|
Testing awk
 GOT "2 0 6 0 6" != "2 0 6 5 6"
 FAIL test 1902 at line 662 in C:\a\1\s\tests\tr1\tests\regex2\test.cpp: (aa|aabaac|ba|b|c)* |aabaac|
FINISHED testing <regex>, part 2
***** 3 erroneous test cases in <regex>, part 2 *****
***** 1980 successful test cases in <regex>, part 2 *****
#FAILED: <regex>, part 2
--

ECMAScript uses leftmost longest rules,
but UNIX uses rightmost longest rules
@strega-nil-ms strega-nil-ms removed their assignment Aug 23, 2022
@StephanTLavavej StephanTLavavej self-assigned this Sep 9, 2022
stl/inc/regex Show resolved Hide resolved
Comment on lines +3413 to +3414
}
if (_Tgt_grp._Begin == _Tgt_grp._End) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We conventionally add newlines between "non-chained" if-statements.

Comment on lines +3420 to +3439
if (_Res_grp._Begin != _Tgt_grp._Begin) {
ptrdiff_t _Res_dist_from_begin = _STD distance(_Begin, _Res_grp._Begin);
ptrdiff_t _Tgt_dist_from_begin = _STD distance(_Begin, _Tgt_grp._Begin);
if (_Sflags & regex_constants::ECMAScript) {
// If one of the matches starts before the other, in ECMAScript, it is considered better.
// This is also known as "leftmost", in languages written from left to right.
return _Tgt_dist_from_begin < _Res_dist_from_begin;
} else {
// Otherwise, Unix uses "rightmost" rules.
return _Res_dist_from_begin < _Tgt_dist_from_begin;
}
}

if (_Res_grp._End != _Tgt_grp._End) {
// If one of the matches is longer than the other, and they start at the same point,
// it is considered better.
ptrdiff_t _Res_size = _STD distance(_Res_grp._Begin, _Res_grp._End);
ptrdiff_t _Tgt_size = _STD distance(_Tgt_grp._Begin, _Tgt_grp._End);
return _Tgt_size > _Res_size;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this logic is incorrect (although I don't immediately know what all of the problems are).

First, although it is not an authoritative reference, https://en.cppreference.com/w/cpp/regex/syntax_option_type describes POSIX as "leftmost longest". I'm surprised to see "rightmost" mentioned in this implementation. (From what I've read, ECMAScript is not "leftmost longest".) Boost explains the Perl/ECMAScript behavior at https://www.boost.org/doc/libs/1_80_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.what_gets_matched and the POSIX behavior at https://www.boost.org/doc/libs/1_80_0/libs/regex/doc/html/boost_regex/syntax/leftmost_longest_rule.html .

Second, the cppreference article depicts the output:

Searching for .*(a|xayy) in zzxayyzz:
 ECMA (depth first search) match: zzxa
 POSIX (leftmost longest)  match: zzxayy

However, both main and this PR print:

Searching for .*(a|xayy) in zzxayyzz:
 ECMA (depth first search) match: zzxayy
 POSIX (leftmost longest)  match: zzxayy

In this example with this PR, the final if-statement is saying "they start at the same point, so prefer the longer one".

(As an aside, this PR's logic is structured and commented in a clear, comprehensible way, which is great! My only concern is matching what the grammars specify.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah, I apparently misunderstood something as I was cleaning up the logic; I think the existing logic was worse, but yeah; not sure what the best way forward is.

@@ -116,6 +116,59 @@ class regex_fixture {
}
}

void should_capture_groups(const std::string& subject, const std::string& pattern,
const std::vector<std::string>& groups,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to include <vector>.

}

if (m[0] != subject) {
printf(R"(should_capture("%s", "%s"): not a full match)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Occurs below) These messages are saying should_capture, but this function is should_capture_groups and takes additional arguments. If we fail, I think all of them should print out the groups, and also the value of flags as an int.

Comment on lines +159 to +163
std::string groups_string = "|";
for (const auto& el : groups) {
groups_string.append(el);
groups_string.push_back('|');
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This seems confusing, because it doesn't look like the callsites. | means "alternations" for regexes, but here you're using it with an unofficial meaning to separate capture groups.

Would formatting the groups as {"group1","group2","group3"} be better?

@@ -546,6 +546,17 @@ void test_VSO_226914_word_boundaries() {
aWordAny.should_search_fail("aa", match_not_bow | match_not_eow);
}

void test_gh_731() {
std::vector<std::string> groups{"AAA", "BBB", ""};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is already using namespace std;. Occurs below.

@@ -546,6 +546,17 @@ void test_VSO_226914_word_boundaries() {
aWordAny.should_search_fail("aa", match_not_bow | match_not_eow);
}

void test_gh_731() {
std::vector<std::string> groups{"AAA", "BBB", ""};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably include <vector> even if <test_regex_support.hpp> already does so.

@@ -546,6 +546,17 @@ void test_VSO_226914_word_boundaries() {
aWordAny.should_search_fail("aa", match_not_bow | match_not_eow);
}

void test_gh_731() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be followed by a comment mentioning the bug's title (like test_gh_993() immediately below).

Comment on lines +550 to +557
std::vector<std::string> groups{"AAA", "BBB", ""};
g_regexTester.should_capture_groups("AAA BBB", R"((A+)\s*(B+)?\s*B*(B*)?)", groups);

groups.assign({"AAA", "BBB"});
g_regexTester.should_capture_groups("AAA BBB", R"((A+)\s*(B+)?\s*B*)", groups);

groups.assign({"c"});
g_regexTester.should_capture_groups("aabaac", "(aa|aabaac|ba|b|c)*", groups, std::regex_constants::extended);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need additional test coverage to exercise _Better_match, especially scenarios where ECMAScript and POSIX Extended accept the same syntax but have different ideas about what should be captured. I think we should also manually verify agreement with Boost/libstdc++/libc++, or understand any remaining divergence. (e.g. I observed MSVC, libstdc++, and libc++ getting 3 sets of answers for the subject "WXconcatenateYZ" with the pattern ".*(cat|concatenate)" https://godbolt.org/z/bza5nrYcM , woof 🐶).

@StephanTLavavej StephanTLavavej removed their assignment Nov 18, 2022
@StephanTLavavej
Copy link
Member

We talked about this at the weekly maintainer meeting. Although this PR in its current state does fix the original scenario (thanks @strega-nil-ms!), there are remaining scenarios that are unfixed, and we're uncertain about the full extent of the problem. We believe it's highly risky to merge changes that alter runtime behavior (in an old and widely-used component) without fully understanding the problems and having high confidence that the fix is both correct and complete. We'll keep the original bug #731 open, since this is a conformance bug that we'd like to fix someday (probably through a complete overhaul in the vNext timeframe), but at the moment we can't merge this PR as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants