openCypher icon indicating copy to clipboard operation
openCypher copied to clipboard

Ignore whitespaces in the grammar

Open jmarton opened this issue 8 years ago • 6 comments

Currently, the openCypher grammar has whitespaces modeled like the snippet below describing that ReadPart is a sequence of zero or more ReadingClause separated by whitespaces.

https://github.com/opencypher/openCypher/blob/8fa5658e433060a208971845a2b301d140461535/grammar/cypher.xml#L151-L155

To be more precise, the excerpt above also states that a ReadPart ends with a whitespace unless empty.

However, the next excerpt from ReadUpdateEnd is formalized more strictly to describe a sequence of one ore more of ReadingClauses that have whitespace separators strictly in between them, but not at the end.

https://github.com/opencypher/openCypher/blob/8fa5658e433060a208971845a2b301d140461535/grammar/cypher.xml#L111-L114

The reason behind the difference between the two excerpts is to make the grammar unambiguous also for whitespace matches.

If we could switch to whitespace-independent grammar, then the excerpt above could be rewritten in a more straightforward way like:

<repeat min="1">
  <non-terminal ref="ReadingClause"/>
</repeat>

I think it would be beneficial to switch to whitespace-independent formalization of the grammar. Also the patches in #300 could be rewritten in a more intuitive way then.

  • the railroad diagram generator already seems to skip whitespaces and does the hard work to illustrate one or more ReadingClause sequence of ReadUpdateEnd in its natural form despite the grammar says it is one followed by zero or more.
  • the generated ANTLR4 grammar could be rewritten to skip whitespaces (at least in theory, but I didn't try to do so)
  • is there a use for whitespaces in the original XML grammar or in the EBNF grammar artifact or anywhere else?

What do you think?

jmarton avatar Feb 13 '18 09:02 jmarton

Note: related issues were discussed earlier in #56. Essentially, handling whitespaces is very easy in CF grammar-based parsers such as plain ANTLR and Xtext (which uses ANTLR 3 internally).

However, it is not trivial in PEG-based tools such as Parboiled. The Handling Whitespace wikipage of the parboiled project says this:

One disadvantage of PEGs over lexer based CFGs can be the handling of white space. In a traditional CFG based parser with a separate lexer (scanner) phase this lexer might simply skip all white space and only generate tokens for the actual parser to operate on. This can free the actual parser grammar from all white space treatment. Since PEGs do not have a lexer but directly operate on the raw input they have to deal with white space in the grammar itself. Language designers with little experience in PEGs can sometime be unsure of how to best handle white space in their grammar.

szarnyasg avatar Feb 14 '18 18:02 szarnyasg

Thank you @szarnyasg for the reference and raising the point of PEGs.

However, PEGs usually do a greedy matching, so keeping whitespaces in the grammar and dropping (hiding) them in the ANTLR4 artifact:

  • might hide ambiguity for whitespace parsing, but greedy matching on whitespaces in PEGs will handle, i.e. for the ambiguous pattern SP* SP*, its first part will consume all whitespaces
  • the ANTLR4 tests will still raise ambiguity exceptions in case non-whitespace ambiguity is introduced
  • still allow for clarified repetition syntax similar to the one given in the issue description:
<repeat min="1">
  <non-terminal ref="ReadingClause"/> &WS;
</repeat>

jmarton avatar Feb 16 '18 21:02 jmarton

I think this makes a lot of sense, and will certainly simplify the grammar as a whole. I don't think I can foresee what exact consequences (if any) this will have for the language, but I can't come up with any real problem, if the rule was formulated that whitespace is allowed between tokens in any position of the query.

Some examples that I've considered that don't seem to be a problem:

  • function invocations: RETURN toInt ( '1' )
  • multi-word operators: RETURN 1 IS NOT NULL
  • patterns: MATCH ( a : A ) - [ r : TYPE | TYPE2 ] - > ( b : B { b : 1 } )

(EDIT: GitHub unhelpfully renders my multi-whitespace separations as single-whitespace in the above. You'll have to imagine wider separation, although it doesn't alter the point really.)

As correctly identified by @jmarton in the above, the specification for whitespace as currently done is to model this very rule, but it turns out it's pretty tricky to get it right.

@thobe Do you have any input for this discussion?

Mats-SX avatar Mar 01 '18 08:03 Mats-SX

Looks like I'm seeing this quite late.

I think what we should do is add a classification of non-terminals whether they are expected to be handled on a lexer level or on a parser level. A processing tool can then use that information to inject whitespace in appropriate places for output formats that needs to be explicit about it.

thobe avatar Dec 20 '18 14:12 thobe

I agree, and we already have started such a scheme:

https://github.com/opencypher/openCypher/blob/64ec11f72be9e49b07ca21930a93b53f35ad57e8/grammar/basic-grammar.xml#L546-L559

Mats-SX avatar Dec 20 '18 17:12 Mats-SX

Yes. I'm not sure I'm a fan of the particular naming, but it does the trick. "All" we need to do is use that for some sort of "whitespace injection", and we're set.

thobe avatar Dec 21 '18 00:12 thobe