BNFC should reject 'bad' filenames names instead of silently sanitizing them
BNFC generates various names (files, directories, classes, ...) based on the filename of the input grammar (e.g. Lang.cf). Not all backends will be able to deal with all legal choices of Lang.cf, however. In those cases, BNFC chooses other names (see e.g. #325 and 22b548381a5d0afb700a983f1066cf72629f0361).
I think that BNFC should reject all choices of Lang.cfthat are somehow incompatible with the backend. This behavior would be less surprising, and more in line with how BNFC deals with unsupported names in other situations: for instance, Foo. Foo ::= Bar ; is not legal when using the Java backend, so BNFC rejects it.
I agree, BNFC's filename sanitization is a bit odd. For instance, Foo1.cf is shortened to stem Foo in some backends (Haskell/Ocaml/Java), dropping the number.
This can however be utilized by developing versions of a grammar called Foo1.cf, Foo2.cf, etc. ending up in the same files.
The main question is how backwards compatible we should be.
In this present case, breaking backwards compatibility is probably harmless, if reported like
ERROR: Foo1 isn't a legal name for a grammar in the Haskell backend (since BNFC version x.y.z).
Legal names are...
However, in case someone really wants one grammar compiled in several backends, there could be mutually exclusive conditions on what legal file identifiers are (e.g. Haskell modules start with a capitial letter, Java packages need to be all lowercase). Concretely, it could hit our parameterized testsuite hard.
A new regime on filenames could be implemented under a flag like --strict-filename (and its opposite, the current behavior, under --sanitize-filename).
Out of curiosity, what is case 0 for this bug report?
I came across this when I found that some of our code apparently relied on a one-to-one correspondence between Lang.cf and the names of directories generated by BNFC: clearly not best practice.
I think it makes sense to fail early as the default (i.e., the --strict-filename case being applied by default), while supporting the old behavior via options. However, if this is not possible, --strict-filename is still a useful addition.
Another useful addition could be a flag --name NAME which would tell BNFC to generate names as if NAME.cf was the filename of the supplied grammar. In this mode, BNFC should always fail unless NAME is compatible with the chosen backend.
I now see that our PLT students shoot themselves in the foot unknowingly by naming their grammar lab1.cf.
So I am tending to enforce a discipline on grammar file names (that could be overwritten by a flag).
What are legal module/package names?
- Haskell module: Uppercase Haskell identifiers (letter, digits, underscores, prime): https://www.haskell.org/onlinereport/lexemes.html
- Ocaml module: same as Haskell: https://ocaml.org/manual/names.html#module-name
- Java package: Java identifiers, separated by dot. Should start with lowercase letter, best practice: all lowercase.
Where does sanitization happen?
- Haskell: CamlCase https://github.com/BNFC/bnfc/blob/07b97729dbbac2c5f2c1fb69256f02078fa0aacd/source/src/BNFC/Backend/Haskell/HsOpts.hs#L70-L71 https://github.com/BNFC/bnfc/blob/07b97729dbbac2c5f2c1fb69256f02078fa0aacd/source/src/BNFC/Backend/Haskell/HsOpts.hs#L78-L87 https://github.com/BNFC/bnfc/blob/07b97729dbbac2c5f2c1fb69256f02078fa0aacd/source/src/BNFC/Backend/Haskell/HsOpts.hs#L120-L128
- OCaml: CamlCase https://github.com/BNFC/bnfc/blob/07b97729dbbac2c5f2c1fb69256f02078fa0aacd/source/src/BNFC/Backend/OCaml.hs#L52-L54
- Java: snake_case https://github.com/BNFC/bnfc/blob/07b97729dbbac2c5f2c1fb69256f02078fa0aacd/source/src/BNFC/Backend/Java.hs#L59-L62
- C: no sanitization needed, as there is no concept of module in C.
- C++: ditto, uses just files and include.