-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<regex>
: Revise caret parsing in basic and grep mode
#5165
base: main
Are you sure you want to change the base?
<regex>
: Revise caret parsing in basic and grep mode
#5165
Conversation
<regex>
: Circumflex ^ should negate character classes in POSIX basic regexes<regex>
: Circumflex ^ should negate character classes in basic regular expressions
<regex>
: Circumflex ^ should negate character classes in basic regular expressions<regex>
: Caret ^
should negate character classes in basic regular expressions
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Thanks @muellerj2! 😻 After looking at this, I believe I found a much simpler way to fix this. Please double-check and meow if you think this is worse than your original fix. |
@StephanTLavavej I think your fix is a suitable solution for But because of your remarks I noticed that Edit: Filed llvm/llvm-project#119432 about the mislanding of |
This comment was marked as outdated.
This comment was marked as outdated.
This is introducing a behavior change that's causing failures in the internal
#include <print>
#include <regex>
using namespace std;
void test(const char* const pattern, const char* const str) {
const regex r{pattern, regex::basic};
println(" pattern: {}", pattern);
println(" str: {}", str);
println("regex_match: {}", regex_match(str, r));
}
int main() {
test(R"(0\(^a\))", "0^a");
}
Before:
After:
|
Something very strange is going on here: https://godbolt.org/z/GsfWEfbhE All standard libraries disagree whether I think this PR is correct, it just exposed some other bug in |
I think I have grasped the underlying issue now, but I need maintainer direction to proceed because there are three possible solutions with different pros and cons.
Given that the POSIX standard does not mandate how a caret at the beginning of a subexpression is to be handled, we have three options how to proceed:
Which option do you consider appropriate? I'm willing to make the changes for any of them. (Even if we go for option 2, we should add tests to make sure we don't regress this behavior in the future. And I think there is still a minor bug in the handling of carets at the beginning of subexpressions, but it can be fixed by changing a single line.) |
I suppose our current behavior is the result of someone thinking that caret should be an anchor when it could possibly match at the beginning of a string (e.g., How does Boost handle caret at the beginning of a subexpression? If Boost and libc++ both agree, it seems like a no-brainer for us to be consistent. Even if they don't, I feel like expressions are more likely to migrate between our implementation and Boost's than ours and another standard library's. Consistency with Boost - which we've been recommending over our ABI-locked-in-a-very-poor-state implementation for a while - is probably the way to go. I think I've convinced myself enough to remove the
decision needed
|
How bad could it be? (Does a quick search of Perennial.) SIXTY occurrences??!? Oh my. |
I agree with Casey that we should figure out what Boost does and if it's moderately sensible, align with it. |
Boost treats carets as anchors. But it seems both libstdc++ and Boost treat carets (and dollar signs) as anchors everywhere in a basic regular expression, contrary to the POSIX standard. Example: https://godbolt.org/z/b1q5Pefqn I think we should comply with the POSIX standard when users are explicitly asking for POSIX regex rules, so I believe MSVC STL shouldn't treat carets as anchors everywhere like libstdc++ and Boost do. But we come closest to Boost's behavior while being standard-compliant if carets are at least interpreted as anchors at the beginning of subexpressions. So I suggest we implement this behavior (and do the same for dollar signs in a follow-up). After STL's changes, this behavior is already implemented by this PR anyway, except for a (pre-existing) miscompilation of double carets. But it's easy to fix this miscompilation. |
…e beginning of alternatives bonus: eliminates order-dependency between lexer tokenization and NFA additions
…g of expressions, subexpressions and grep-mode newline alternatives
<regex>
: Caret ^
should negate character classes in basic regular expressions<regex>
: Revise caret parsing in basic and grep mode
I implemented the necessary changes to consistently treat carets as anchors at the beginning of subexpressions. I also eliminated that weird and brittle dependency of the lexer and the parser on the NFA that they used to guess the parsing context of the caret (and the star). This dependency was central for the behavior change we failed to notice. Now lexer and parser operations no longer depend on how they are sequenced compared to operations performed on the NFA. I repaired the double caret miscompilation, too. Finally, I noticed that the regex parser fails to treat carets after new lines as anchors in grep mode, so I fixed this as well. I adjusted the title of the PR a bit because it includes much more now than just a fix for the negation of character classes. |
+100 yes, our intent is that we should first conform to the spec, and only where the spec leaves things up to interpretation should we try to do as Boost does. |
This PR revises how carets are parsed in regular expressions in basic and grep mode. Specifically:
[^a]
should be the character class that does not contain a, but it is incorrectly parsed as the class that contains a and ^. This PR fixes this.As a bonus,
_Parser::_Next()
and_Parser::_Trans()
no longer try to guess the parsing context from the state of the NFA to adjust the tokenization to the current context. This eliminates some brittle coupling between lexer, parser and NFA builder, such as the order dependency between NFA changes and tokenization results. This order dependency was also the subtle cause of the unexpected behavior change that previously led to this PR being dropped from a merge batch.Correctness of recognizing caret anchor positions in basic regular expressions and in grep mode
The POSIX standard says that carets are to be treated as anchors at the beginning of a regular expression and may be treated so at the beginning of any subexpressions (which this PR opts to do in accordance with libstdc++ and Boost). But strictly speaking, this PR seems to actually implement something subtly different: The changed parser interprets carets as anchors when they appear at the beginning of an alternative in a disjunction. However, it turns out that this is actually equivalent:
pattern_list
argument in the POSIX standard), so a caret at the beginning of a line is to be treated as an anchor. Thus, carets at the beginning of a second, third or n-th alternative are anchors in grep mode. Hence, carets at the beginnings of any alternative should be treated as anchors in grep mode, which is what this PR implements.The preceding argument essentially also applies to the treatment of initial stars as ordinary characters, but with the caveat that an optional caret anchor is allowed between the beginning of the alternative/expression/subexpression and the star.