pest icon indicating copy to clipboard operation
pest copied to clipboard

`one or more` rules consume trailing whitespace when exactly matched once

Open ejoebstl opened this issue 6 years ago • 4 comments

Given the following grammar:

a = { "a" ~ b+ | a }
b = { "b" }
root = { SOI ~ a* ~ EOI }

Then, parsing the following text:

a b

Will associate the trailing newline after b with the rule for a, instead of the the rule for root.

That seems to work properly with zero-or-more expressions, where the whitespace will be associated with the root rule.

Also, it works properly for more than one b. The following example:

a b b

Puts the whitespace correctly inside root.

ejoebstl avatar Jun 15 '19 11:06 ejoebstl

After looking through the code, I'd say this is because Expr::RepOnce is translated to a sequence.

E.g. for pest, b+ is equal to b ~ b*.

ejoebstl avatar Jun 15 '19 12:06 ejoebstl

Proposed fix in #397.

ejoebstl avatar Jun 15 '19 12:06 ejoebstl

Ah, finally. I knew this day was going to come. Unfortunately, this implementation bug is probably something that people make use of in their logic. Fixing this might actually break them, but I'm happy to give it a try regardless maybe there's no actual dependee that uses it.

I have a fix for 3.0, but I haven't had that much time to work on it.

dragostis avatar Jun 15 '19 14:06 dragostis

This just bit us. Are there still plans to resolve this? Is there evidence that people rely on this bug?

And lastly, is there a good workaround? The obvious idea would be to strip the white space off of everything pest returns, but I suspect there is a more elegant way, since otherwise there probably would be more people commenting here.

In our case we are having trouble to parse data in the format "x <- foo" (see this page).

Also, otherwise we are having a great time using pest, thank you!

cc @siccegge @chrisbrzuska

Update:

I found a satisfying workaround, documented here for future visitors. Instead of

identifier = { (ASCII_ALPHA | "_")+ }

I now use

identifier = @{ (ASCII_ALPHA | "_") ~ (ASCII_ALPHA | "_")* }

where @ means that no internal white space is allowed and all subrules are muted (which is not important here since there are no internal rules).

keks avatar Apr 06 '22 22:04 keks