AVRO-3825: [Java] Disallow invalid namespaces
What is the purpose of the change
According to the specification, each portion of a namespace separated by dot should be [a-zA-Z_][a-zA-Z0-9_].
https://avro.apache.org/docs/1.11.1/specification/#names
The name portion of the fullname of named types, record field names, and enum symbols must:
start with [A-Za-z_]
subsequently contain only [A-Za-z0-9_]
A namespace is a dot-separated sequence of such names. The empty string may also be used as a namespace to indicate the null namespace. Equality of names (including field names and enum symbols) as well as fullnames is case-sensitive.
The null namespace may not be used in a dot-separated sequence of names. So the grammar for a namespace is:
<empty> | <name>[(<dot><name>)*]
But the current Java binding accept invalid namespaces. This PR aims to fix this issue.
Verifying this change
Added new test and it passed on my laptop.
Documentation
No new features added.
Hmm, the Java binding seems to generate namespaces which contain $ for inner classes/enums.
[INFO] Running org.apache.avro.protobuf.TestProtobuf
Error: Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.372 s <<< FAILURE! - in org.apache.avro.protobuf.TestProtobuf
Error: org.apache.avro.protobuf.TestProtobuf.nestedEnum Time elapsed: 0.018 s <<< ERROR!
java.lang.RuntimeException: org.apache.avro.SchemaParseException: Illegal character in: Test$M
at org.apache.avro.protobuf.ProtobufData.getSchema(ProtobufData.java:187)
at org.apache.avro.protobuf.TestProtobuf.nestedEnum(TestProtobuf.java:115)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.util.ArrayList.forEach(ArrayList.java:1259)
at java.util.ArrayList.forEach(ArrayList.java:1259)
Caused by: org.apache.avro.SchemaParseException: Illegal character in: Test$M
at org.apache.avro.Schema.validateName(Schema.java:1641)
at org.apache.avro.Schema.validateSpace(Schema.java:1652)
at org.apache.avro.Schema.access$700(Schema.java:96)
at org.apache.avro.Schema$Name.<init>(Schema.java:720)
at org.apache.avro.Schema.createEnum(Schema.java:233)
at org.apache.avro.protobuf.ProtobufData.getSchema(ProtobufData.java:322)
at org.apache.avro.protobuf.ProtobufData.getSchema(ProtobufData.java:183)
Pending discussion. https://lists.apache.org/thread/pxlcrvd24sdhfz5fy3xn307jnd7c29zp
This use of the dollar sign is apparently defined in Java language specification §13.1 (The Form of a Binary), so all Java compilers should generate the same binary name Test$M. However, for local or anonymous classes, the JLS requires "a non-empty sequence of digits", and I think different compilers could choose different digits. Application developers then should not use such classes for Avro data, as it would be difficult from them to ensure that the name stays the same.
I'd prefer keeping dollar signs disallowed in the Avro specification. Schemas that use dollar signs won't be fully portable to other languages, even if the Java implementation of Avro allows them. It might be useful to have a strict mode that flags them in the Java implementation too, but I don't know how that should be configured.
This use of the dollar sign is apparently defined in Java language specification §13.1 (The Form of a Binary), so all Java compilers should generate the same binary name Test$M. However, for local or anonymous classes, the JLS requires "a non-empty sequence of digits", and I think different compilers could choose different digits. Application developers then should not use such classes for Avro data, as it would be difficult from them to ensure that the name stays the same.
I know that $ is automatically generated for inner classes and it's ideal that users don't use such classes. But protobuf generates such classes and the Java binding supports a feature that generates a schema from protobuf-generated classes.
On a side note, this change may lead to more previously accepted schemata to fail parsing.
PR #2448 includes an addition to the specification that describes how to fix this.
cc: @RyanSkraba
Hi Folks - Glad we have a PR to address this issue, it surfaced recently, and was wondering if there are plans to land this fix sometime in the near future. Thanks! 🙂
I'll update this PR after this PR is merged because this PR modifies a .proto file and need to regenerate .class files for test.