`one or more` rules consume trailing whitespace when exactly matched once
Given the following grammar:
a = { "a" ~ b+ | a }
b = { "b" }
root = { SOI ~ a* ~ EOI }
Then, parsing the following text:
a b
Will associate the trailing newline after b with the rule for a, instead of the the rule for root.
That seems to work properly with zero-or-more expressions, where the whitespace will be associated with the root rule.
Also, it works properly for more than one b. The following example:
a b b
Puts the whitespace correctly inside root.
After looking through the code, I'd say this is because Expr::RepOnce is translated to a sequence.
E.g. for pest, b+ is equal to b ~ b*.
Proposed fix in #397.
Ah, finally. I knew this day was going to come. Unfortunately, this implementation bug is probably something that people make use of in their logic. Fixing this might actually break them, but I'm happy to give it a try regardless maybe there's no actual dependee that uses it.
I have a fix for 3.0, but I haven't had that much time to work on it.
This just bit us. Are there still plans to resolve this? Is there evidence that people rely on this bug?
And lastly, is there a good workaround? The obvious idea would be to strip the white space off of everything pest returns, but I suspect there is a more elegant way, since otherwise there probably would be more people commenting here.
In our case we are having trouble to parse data in the format "x <- foo" (see this page).
Also, otherwise we are having a great time using pest, thank you!
cc @siccegge @chrisbrzuska
Update:
I found a satisfying workaround, documented here for future visitors. Instead of
identifier = { (ASCII_ALPHA | "_")+ }
I now use
identifier = @{ (ASCII_ALPHA | "_") ~ (ASCII_ALPHA | "_")* }
where @ means that no internal white space is allowed and all subrules are muted (which is not important here since there are no internal rules).