avro icon indicating copy to clipboard operation
avro copied to clipboard

AVRO-3825: [Java] Disallow invalid namespaces

Open sarutak opened this issue 2 years ago • 8 comments

What is the purpose of the change

According to the specification, each portion of a namespace separated by dot should be [a-zA-Z_][a-zA-Z0-9_]. https://avro.apache.org/docs/1.11.1/specification/#names

The name portion of the fullname of named types, record field names, and enum symbols must:

    start with [A-Za-z_]
    subsequently contain only [A-Za-z0-9_]

A namespace is a dot-separated sequence of such names. The empty string may also be used as a namespace to indicate the null namespace. Equality of names (including field names and enum symbols) as well as fullnames is case-sensitive.

The null namespace may not be used in a dot-separated sequence of names. So the grammar for a namespace is:

  <empty> | <name>[(<dot><name>)*]

But the current Java binding accept invalid namespaces. This PR aims to fix this issue.

Verifying this change

Added new test and it passed on my laptop.

Documentation

No new features added.

sarutak avatar Aug 09 '23 09:08 sarutak

Hmm, the Java binding seems to generate namespaces which contain $ for inner classes/enums.

[INFO] Running org.apache.avro.protobuf.TestProtobuf
Error:  Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.372 s <<< FAILURE! - in org.apache.avro.protobuf.TestProtobuf
Error:  org.apache.avro.protobuf.TestProtobuf.nestedEnum  Time elapsed: 0.018 s  <<< ERROR!
java.lang.RuntimeException: org.apache.avro.SchemaParseException: Illegal character in: Test$M
	at org.apache.avro.protobuf.ProtobufData.getSchema(ProtobufData.java:187)
	at org.apache.avro.protobuf.TestProtobuf.nestedEnum(TestProtobuf.java:115)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.util.ArrayList.forEach(ArrayList.java:1259)
	at java.util.ArrayList.forEach(ArrayList.java:1259)
Caused by: org.apache.avro.SchemaParseException: Illegal character in: Test$M
	at org.apache.avro.Schema.validateName(Schema.java:1641)
	at org.apache.avro.Schema.validateSpace(Schema.java:1652)
	at org.apache.avro.Schema.access$700(Schema.java:96)
	at org.apache.avro.Schema$Name.<init>(Schema.java:720)
	at org.apache.avro.Schema.createEnum(Schema.java:233)
	at org.apache.avro.protobuf.ProtobufData.getSchema(ProtobufData.java:322)
	at org.apache.avro.protobuf.ProtobufData.getSchema(ProtobufData.java:183)

sarutak avatar Aug 09 '23 14:08 sarutak

Pending discussion. https://lists.apache.org/thread/pxlcrvd24sdhfz5fy3xn307jnd7c29zp

sarutak avatar Aug 09 '23 15:08 sarutak

This use of the dollar sign is apparently defined in Java language specification §13.1 (The Form of a Binary), so all Java compilers should generate the same binary name Test$M. However, for local or anonymous classes, the JLS requires "a non-empty sequence of digits", and I think different compilers could choose different digits. Application developers then should not use such classes for Avro data, as it would be difficult from them to ensure that the name stays the same.

I'd prefer keeping dollar signs disallowed in the Avro specification. Schemas that use dollar signs won't be fully portable to other languages, even if the Java implementation of Avro allows them. It might be useful to have a strict mode that flags them in the Java implementation too, but I don't know how that should be configured.

KalleOlaviNiemitalo avatar Aug 09 '23 16:08 KalleOlaviNiemitalo

This use of the dollar sign is apparently defined in Java language specification §13.1 (The Form of a Binary), so all Java compilers should generate the same binary name Test$M. However, for local or anonymous classes, the JLS requires "a non-empty sequence of digits", and I think different compilers could choose different digits. Application developers then should not use such classes for Avro data, as it would be difficult from them to ensure that the name stays the same.

I know that $ is automatically generated for inner classes and it's ideal that users don't use such classes. But protobuf generates such classes and the Java binding supports a feature that generates a schema from protobuf-generated classes.

sarutak avatar Aug 09 '23 16:08 sarutak

On a side note, this change may lead to more previously accepted schemata to fail parsing.

PR #2448 includes an addition to the specification that describes how to fix this.

opwvhk avatar Aug 17 '23 09:08 opwvhk

cc: @RyanSkraba

sarutak avatar Aug 19 '23 08:08 sarutak

Hi Folks - Glad we have a PR to address this issue, it surfaced recently, and was wondering if there are plans to land this fix sometime in the near future. Thanks! 🙂

locomotif avatar Oct 07 '23 17:10 locomotif

I'll update this PR after this PR is merged because this PR modifies a .proto file and need to regenerate .class files for test.

sarutak avatar Oct 09 '23 00:10 sarutak