Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow manual rule definition for complex rules #600

Open
tadman opened this issue May 27, 2022 · 3 comments
Open

Allow manual rule definition for complex rules #600

tadman opened this issue May 27, 2022 · 3 comments

Comments

@tadman
Copy link

tadman commented May 27, 2022

While the .pest grammar format is quite flexible, there are circumstances under which it's incapable of expressing what's required. Writing the rule manually can solve the problem, but it seems like Pest either supports automatic generation for every function, no exceptions, or you must manually define everything.

I'm facing a situation where a single rule out of hundreds is unable to be expressed with the grammar.

Allowing for manual function definition in addition to automatic definition would solve this.

For example, imagine a grammar like:

char = _{ '\u{01}'..'\u{7f}' }
number = { ('0'..'9')+ }
literal = { "{" ~ number ~ "}" char* }

Where the parser needs to handle elements like "{4}testX..." being parsed as ( 4, "test" ) and the X... part is not consumed, but left for the next element. In order for this to work, number needs to be converted and employed to consume a fixed number of char. Easily handled with a manual parser.

I'd like to propose an alternative syntax for situations like this:

char = _{ '\u{01}'..'\u{7f}' }
number = { ('0'..'9')+ }
literal = fn

Where that indicates literal is a manually defined function and is called as Self::literal(...) instead.

I've been working on a fork which implements this where it should be able to handle this with some very minor alterations, but would appreciate some feedback and assistance.

@HoloTheDrunk
Copy link
Contributor

HoloTheDrunk commented Oct 27, 2022

I'm not sure I understand your example. There is no indication of what the rule for splitting the characters that come after the closing bracket should be. The logical inconsistencies like in this sentence

In order for this to work, number needs to be converted and employed to consume a fixed number of char.

where number comes in despite not containing a call to the char rule, on top of the aforementioned lack of information makes it very hard to provide help.

As far as I can tell, what you're trying to do is already easily possible, but I'm not even sure of what you're actually trying to do.

Things that could help:

  • Fixed repetition: char{4} matches precisely 4 instances of char
  • Look-ahead: { (!"}" ~ ANY)* } for example matches anything up until a closing bracket without consuming the closing bracket

Given that you opened this issue quite a long time ago, I hope you've since found the answer; I'm mostly answering this in case other people bump into this issue. Have a nice day :)

@tadman
Copy link
Author

tadman commented Oct 29, 2022

This came about due to a very bizarre feature of the IMAP specification where a "synchronizing literal" is delimited this way. What's needed is for the sequence {4}ABCDEF.. to be parsed as the tokens 4, ABCD with the EF... part not consumed, as it's another sequence. Where it's {2}ABCDEF then it parses as 2, AB with CDEF... left alone.

The {n} part means the following n octets are part of the token, then the remainder reverts to regular parsing.

I might be understanding Pest incorrectly, but I need to extract the number, convert it to an integer, then step through the string n characters exactly. It's nice you can repeat using a similar notation in a grammar, but this length is unknown until user input is processed. It could be anything. I can't match a precise number because that number is run-time generated.

Additionally there's no delimiter that can be used to capture the end, it's just a series of random octets, no context given other than the length identifier.

Hope that explains better. Even if this was (somehow?) accommodated by Pest in the grammar itself, being able to drop down and implement it in very specific detail still seems like it could be useful from time to time. Right now it seems like you can either hand-assemble your entire grammar, or have it all auto-generated, with no opportunity to selectively switch. This fork allows you to define functions that are mapped into your auto-generated grammar, which I think could be helpful.

@tadman
Copy link
Author

tadman commented Nov 23, 2024

Is there an updated way to tackle this particular problem? The PR I contributed a while ago probably won't merge in cleanly, but it shows a way to handle out of scope parsing issues without overly complicating things, or so I hope.

This is to cover cases like the above where the length of the token is specified by parsing and interpreting some of the earlier data. IMAP's protocol has some very strange features like this which make it impossible to implement in regular Pest.

If there is a way to cover this using built-in features, I'd love to know, as that would save a lot of hassle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants