Fix better_match check used by regex search routine #1547

cpp11nullptr · 2020-12-22T02:07:50Z

That is an attempt to fix the issue #731 by taking a number of captured groups into consideration during performing better_match check.

StephanTLavavej · 2021-01-09T03:55:22Z

stl/inc/regex

    }
-    return false;
+
+    // more captured groups means a better match


I have the following concerns:

Where is this mandated by ECMAScript? I am wary of adding logic to make this specific scenario work, without knowing exactly what Standardese it corresponds to.

Can we construct test cases that stress this rule (i.e. cases where it activates and causes different choices to be made), and compare the observed behavior of libc++, libstdc++, and browsers?

Does this affect the other grammars, and are they all specified to behave the same way?

In Boost's documentation (which is of course non-Standard, but more comprehensible than ECMAScript/POSIX/etc.) I see that the Perl/ECMAScript behavior is described as "the first match found by a depth-first-search" and "leftmost", while basic/extended behavior is "leftmost longest".

cpp11nullptr · 2021-01-17T12:03:48Z

Thanks for additional input, @StephanTLavavej!

I could not find precise description what behaviour should be in such scenarios but this PR should correct a capturing for cases, such as (A+)\s*(B+)?\s*B*, when for "AAA BBB" target there is firstly found (A+)..B* match with only (A+) group captured with writing to _Res with no way to overwrite it even if (A+)..(B+) match (with two captured groups) found. Perhaps, it is because of implementation details and, as alternative, a logic could be changed to make (A+)..(B+) match found firstly (but it's not so robust).
Good point, I added few more examples:

#include <iostream>
#include <regex>
#include <string>

void find_match(const std::string& title, const std::string& rgstr, const std::string& str)
{
    std::cout << title << ": ";

    const std::regex rg{ rgstr };
    std::smatch match{};
    if (!std::regex_match(str, match, rg))
    {
        std::cout << "No match." << std::endl;

        return;
    }

    for (size_t i{ 1 }; i < match.size(); ++i)
    {
        std::cout << "[" << i << "] = " << match[i] << " ";
    }

    std::cout << std::endl;
}

int main()
{
    using namespace std::string_literals;

    find_match("Case 0"s, R"((A+)\s*(B+)?\s*B*)"s, "AAA BBB"s);
    find_match("Case 1"s, R"((A+)\s*B*)"s, "AAA BBB"s);
    find_match("Case 2"s, R"((A+)\s*B*(B+))"s, "AAA BBB"s);
    find_match("Case 3"s, R"((A+)\s*(B+)?\s*B*(B+))"s, "AAA BBB"s);

    return 0;
}

With following results:

msvc (without fix):

Case 0: [1] = AAA [2] =
Case 1: [1] = AAA
Case 2: [1] = AAA [2] = B
Case 3: [1] = AAA [2] =  [3] = B

msvc (with fix), libc++ HEAD 12.0 (Wandbox), libstdc++ HEAD 11.0.0 (Wandbox) and browser (regex101.com) produce same results:

Case 0: [1] = AAA [2] = BBB
Case 1: [1] = AAA
Case 2: [1] = AAA [2] = B
Case 3: [1] = AAA [2] = BB [3] = B

I don't think that other grammars should be affected by this change as _Better_match() is used only in context when a decision whether new match should replace already found one should be made (after finishing lookup/sub-lookup during processing _N_end node).

Also, based on mentioned The Leftmost Longest Rule it seems that step 6 there covers an extension of capturing with additional groups, if I understand that statement correctly.

strega-nil-ms · 2022-08-23T18:21:48Z

The boost documentation links "leftmost" with "leftmost longest", and so I do believe that ECMAScript also follows the leftmost longest rule.

However, I believe that this is an incorrect check. Leftmost longest does not consider how many groups are matched; only, from front to back, the length of each group and its location. The issue with the current check is not that it doesn't check how many groups are matched - it is that it doesn't consider "non-matching" groups to be worse than matching groups.

As I was working on this, I also noticed a bug in _Better_match() - it considers the rightmost match to be better:

            if (_Res._Grps[_Ix]._Begin != _Tgt_state._Grps[_Ix]._Begin) {
                return _STD distance(_Begin, _Res._Grps[_Ix]._Begin)
                     < _STD distance(_Begin, _Tgt_state._Grps[_Ix]._Begin);
            }

should be

            if (_Res._Grps[_Ix]._Begin != _Tgt_state._Grps[_Ix]._Begin) {
                return _STD distance(_Begin, _Tgt_state._Grps[_Ix]._Begin)
                     < _STD distance(_Begin, _Res._Grps[_Ix]._Begin);
            }

With this bug fixed, we can find a place where the code you've written fails:

regex = /(A+)\s*(B+)?\s*B*(B*)?/
string = AAA BBB

The correct match for this should be |AAA|BBB|-|, however, your code gives |AAA|-|BBB|.

StephanTLavavej

tr1/tests/regex2 is failing:

Standard Output:
--
Testing ECMAScript
Testing UNIX BRE
Testing grep
Testing UNIX ERE
 GOT "2 0 6 0 6" != "2 0 6 5 6"
 FAIL test 1213 at line 662 in C:\a\1\s\tests\tr1\tests\regex2\test.cpp: (aa|aabaac|ba|b|c)* |aabaac|
Testing egrep
 GOT "2 0 6 0 6" != "2 0 6 5 6"
 FAIL test 1557 at line 662 in C:\a\1\s\tests\tr1\tests\regex2\test.cpp: (aa|aabaac|ba|b|c)* |aabaac|
Testing awk
 GOT "2 0 6 0 6" != "2 0 6 5 6"
 FAIL test 1902 at line 662 in C:\a\1\s\tests\tr1\tests\regex2\test.cpp: (aa|aabaac|ba|b|c)* |aabaac|
FINISHED testing <regex>, part 2
***** 3 erroneous test cases in <regex>, part 2 *****
***** 1980 successful test cases in <regex>, part 2 *****
#FAILED: <regex>, part 2
--

ECMAScript uses leftmost longest rules, but UNIX uses rightmost longest rules

stl/inc/regex

StephanTLavavej · 2022-11-18T03:35:11Z

stl/inc/regex

+        }
+        if (_Tgt_grp._Begin == _Tgt_grp._End) {


We conventionally add newlines between "non-chained" if-statements.

StephanTLavavej · 2022-11-18T04:00:12Z

stl/inc/regex

+        if (_Res_grp._Begin != _Tgt_grp._Begin) {
+            ptrdiff_t _Res_dist_from_begin = _STD distance(_Begin, _Res_grp._Begin);
+            ptrdiff_t _Tgt_dist_from_begin = _STD distance(_Begin, _Tgt_grp._Begin);
+            if (_Sflags & regex_constants::ECMAScript) {
+                // If one of the matches starts before the other, in ECMAScript, it is considered better.
+                // This is also known as "leftmost", in languages written from left to right.
+                return _Tgt_dist_from_begin < _Res_dist_from_begin;
+            } else {
+                // Otherwise, Unix uses "rightmost" rules.
+                return _Res_dist_from_begin < _Tgt_dist_from_begin;
            }
        }
+
+        if (_Res_grp._End != _Tgt_grp._End) {
+            // If one of the matches is longer than the other, and they start at the same point,
+            // it is considered better.
+            ptrdiff_t _Res_size = _STD distance(_Res_grp._Begin, _Res_grp._End);
+            ptrdiff_t _Tgt_size = _STD distance(_Tgt_grp._Begin, _Tgt_grp._End);
+            return _Tgt_size > _Res_size;
+        }


I believe this logic is incorrect (although I don't immediately know what all of the problems are).

First, although it is not an authoritative reference, https://en.cppreference.com/w/cpp/regex/syntax_option_type describes POSIX as "leftmost longest". I'm surprised to see "rightmost" mentioned in this implementation. (From what I've read, ECMAScript is not "leftmost longest".) Boost explains the Perl/ECMAScript behavior at https://www.boost.org/doc/libs/1_80_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.what_gets_matched and the POSIX behavior at https://www.boost.org/doc/libs/1_80_0/libs/regex/doc/html/boost_regex/syntax/leftmost_longest_rule.html .

Second, the cppreference article depicts the output:

Searching for .*(a|xayy) in zzxayyzz: ECMA (depth first search) match: zzxa POSIX (leftmost longest) match: zzxayy

However, both main and this PR print:

Searching for .*(a|xayy) in zzxayyzz: ECMA (depth first search) match: zzxayy POSIX (leftmost longest) match: zzxayy

In this example with this PR, the final if-statement is saying "they start at the same point, so prefer the longer one".

(As an aside, this PR's logic is structured and commented in a clear, comprehensible way, which is great! My only concern is matching what the grammars specify.)

hmm, yeah, I apparently misunderstood something as I was cleaning up the logic; I think the existing logic was worse, but yeah; not sure what the best way forward is.

StephanTLavavej · 2022-11-18T04:01:29Z

tests/std/include/test_regex_support.hpp

@@ -116,6 +116,59 @@ class regex_fixture {
        }
    }

+    void should_capture_groups(const std::string& subject, const std::string& pattern,
+        const std::vector<std::string>& groups,


Need to include <vector>.

StephanTLavavej · 2022-11-18T04:06:32Z

tests/std/include/test_regex_support.hpp

+            }
+
+            if (m[0] != subject) {
+                printf(R"(should_capture("%s", "%s"): not a full match)"


(Occurs below) These messages are saying should_capture, but this function is should_capture_groups and takes additional arguments. If we fail, I think all of them should print out the groups, and also the value of flags as an int.

StephanTLavavej · 2022-11-18T04:14:42Z

tests/std/include/test_regex_support.hpp

+            std::string groups_string = "|";
+            for (const auto& el : groups) {
+                groups_string.append(el);
+                groups_string.push_back('|');
+            }


Suggestion: This seems confusing, because it doesn't look like the callsites. | means "alternations" for regexes, but here you're using it with an unofficial meaning to separate capture groups.

Would formatting the groups as {"group1","group2","group3"} be better?

StephanTLavavej · 2022-11-18T04:15:28Z

tests/std/tests/VSO_0000000_regex_use/test.cpp

@@ -546,6 +546,17 @@ void test_VSO_226914_word_boundaries() {
    aWordAny.should_search_fail("aa", match_not_bow | match_not_eow);
 }

+void test_gh_731() {
+    std::vector<std::string> groups{"AAA", "BBB", ""};


This test is already using namespace std;. Occurs below.

StephanTLavavej · 2022-11-18T04:15:56Z

tests/std/tests/VSO_0000000_regex_use/test.cpp

@@ -546,6 +546,17 @@ void test_VSO_226914_word_boundaries() {
    aWordAny.should_search_fail("aa", match_not_bow | match_not_eow);
 }

+void test_gh_731() {
+    std::vector<std::string> groups{"AAA", "BBB", ""};


Should probably include <vector> even if <test_regex_support.hpp> already does so.

StephanTLavavej · 2022-11-18T04:17:50Z

tests/std/tests/VSO_0000000_regex_use/test.cpp

@@ -546,6 +546,17 @@ void test_VSO_226914_word_boundaries() {
    aWordAny.should_search_fail("aa", match_not_bow | match_not_eow);
 }

+void test_gh_731() {


This should be followed by a comment mentioning the bug's title (like test_gh_993() immediately below).

StephanTLavavej · 2022-11-18T04:21:45Z

tests/std/tests/VSO_0000000_regex_use/test.cpp

+    std::vector<std::string> groups{"AAA", "BBB", ""};
+    g_regexTester.should_capture_groups("AAA BBB", R"((A+)\s*(B+)?\s*B*(B*)?)", groups);
+
+    groups.assign({"AAA", "BBB"});
+    g_regexTester.should_capture_groups("AAA BBB", R"((A+)\s*(B+)?\s*B*)", groups);
+
+    groups.assign({"c"});
+    g_regexTester.should_capture_groups("aabaac", "(aa|aabaac|ba|b|c)*", groups, std::regex_constants::extended);


I think we need additional test coverage to exercise _Better_match, especially scenarios where ECMAScript and POSIX Extended accept the same syntax but have different ideas about what should be captured. I think we should also manually verify agreement with Boost/libstdc++/libc++, or understand any remaining divergence. (e.g. I observed MSVC, libstdc++, and libc++ getting 3 sets of answers for the subject "WXconcatenateYZ" with the pattern ".*(cat|concatenate)" https://godbolt.org/z/bza5nrYcM , woof 🐶).

StephanTLavavej · 2022-11-30T22:37:42Z

We talked about this at the weekly maintainer meeting. Although this PR in its current state does fix the original scenario (thanks @strega-nil-ms!), there are remaining scenarios that are unfixed, and we're uncertain about the full extent of the problem. We believe it's highly risky to merge changes that alter runtime behavior (in an old and widely-used component) without fully understanding the problems and having high confidence that the fix is both correct and complete. We'll keep the original bug #731 open, since this is a conformance bug that we'd like to fix someday (probably through a complete overhaul in the vNext timeframe), but at the moment we can't merge this PR as-is.

cpp11nullptr requested a review from a team as a code owner December 22, 2020 02:07

StephanTLavavej added the bug Something isn't working label Jan 5, 2021

StephanTLavavej reviewed Jan 9, 2021

View reviewed changes

StephanTLavavej assigned cbezault Jan 9, 2021

Base automatically changed from master to main January 28, 2021 00:35

StephanTLavavej unassigned cbezault Feb 17, 2021

StephanTLavavej self-assigned this Apr 28, 2021

StephanTLavavej assigned strega-nil-ms and unassigned StephanTLavavej Aug 17, 2022

cpp11nullptr added 3 commits August 23, 2022 08:30

Fix better_match check used by regex search routine

f45ed70

Add unit tests and fix code style

e7bec21

Refer egrep mode with multiline captures

af26bd4

strega-nil-ms force-pushed the fix-regex-better-match-check branch from b565824 to af26bd4 Compare August 23, 2022 15:51

use leftmost longest rule

7cf7f37

StephanTLavavej requested changes Aug 23, 2022

View reviewed changes

oh, oops

de9c799

ECMAScript uses leftmost longest rules, but UNIX uses rightmost longest rules

strega-nil-ms requested a review from StephanTLavavej August 23, 2022 21:09

remove includes

2f37a47

strega-nil-ms removed their assignment Aug 23, 2022

StephanTLavavej self-assigned this Sep 9, 2022

strega-nil-ms reviewed Sep 12, 2022

View reviewed changes

stl/inc/regex Show resolved Hide resolved

StephanTLavavej requested changes Nov 18, 2022

View reviewed changes

StephanTLavavej removed their assignment Nov 18, 2022

StephanTLavavej closed this Nov 30, 2022

muellerj2 mentioned this pull request Jan 2, 2025

<regex>: Fix depth-first and leftmost-longest matching rules #5218

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix better_match check used by regex search routine #1547

Fix better_match check used by regex search routine #1547

cpp11nullptr commented Dec 22, 2020

StephanTLavavej Jan 9, 2021

cpp11nullptr commented Jan 17, 2021 •

edited

Loading

strega-nil-ms commented Aug 23, 2022

StephanTLavavej left a comment

StephanTLavavej Nov 18, 2022

StephanTLavavej Nov 18, 2022

strega-nil-ms Nov 21, 2022

StephanTLavavej Nov 18, 2022

StephanTLavavej Nov 18, 2022

StephanTLavavej Nov 18, 2022

StephanTLavavej Nov 18, 2022

StephanTLavavej Nov 18, 2022

StephanTLavavej Nov 18, 2022

StephanTLavavej Nov 18, 2022

StephanTLavavej commented Nov 30, 2022

Fix better_match check used by regex search routine #1547

Fix better_match check used by regex search routine #1547

Conversation

cpp11nullptr commented Dec 22, 2020

Choose a reason for hiding this comment

cpp11nullptr commented Jan 17, 2021 • edited Loading

strega-nil-ms commented Aug 23, 2022

StephanTLavavej left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StephanTLavavej commented Nov 30, 2022

cpp11nullptr commented Jan 17, 2021 •

edited

Loading