avro AVRO-3560: throw SchemaParseException on dangling content in avsc beyond end of schema

Make sure you have checked all steps below.

Jira

[x] My PR addresses the following Avro Jira issues and references them in the PR title. For example, "AVRO-1234: My Avro PR"
- https://issues.apache.org/jira/browse/AVRO-3560

Tests

[x] My PR adds the following unit test:
- TestSchema.testContentAfterAvsc()
- TestSchema.testContentAfterAvscInInputStream()

Commits

[x] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

[x] In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

Jul 01 '22 06:07 radai-rosenblatt

ended up finding a few bad schemas in other unit tests as well ...

Jul 01 '22 06:07 radai-rosenblatt

Perhaps this change should not apply to parse(InputStream in) where the caller might expect to be able to read content after the schema.

Jul 01 '22 07:07 KalleOlaviNiemitalo

updated the PR to allow dangling content when parsing from InputStream

Jul 01 '22 07:07 radai-rosenblatt

Could you add tests for trailing content in a UTF-8 file parsed using Schema.parse(File file)? JsonFactory.createParser(File f) apparently creates an InputStream for that, and JsonFactory._createParser(InputStream in, IOContext ctxt) calls ByteSourceJsonBootstrapper.constructParser, which creates an UTF8StreamJsonParser in that case. UTF8StreamJsonParser is a "byte-based" parser and inherits JsonParser.releaseBuffered(Writer w), which just returns -1, but UTF8StreamJsonParser.releaseBuffered(OutputStream out) can write something to the OutputStream. I think this means that, to correctly detect trailing content in a UTF-8 file, Schema.parse would have to call not only JsonParser.releaseBuffered(Writer) but also JsonParser.releaseBuffered(OutputStream), or first call JsonParser.getInputSource() and then use the type of the result to guess whether the parser is byte-based or char-based.

If parse(InputStream) parses a stream that has trailing content, and JsonParser buffers that content, it would be best to return that content to the InputStream so that the caller can then read it. However I don't see how to do that.

Is it possible that JsonParser buffers some content from a Reader, and the buffered content is all space characters and thus ignored by Avro.Schema, but the Reader has more content that JsonContent did not even read because its buffer filled up? In which case, Avro.Schema would have to check the Reader as well.

Jul 01 '22 07:07 KalleOlaviNiemitalo

i'll add a check for the File method. as for the case of a buffer full of whitespace (and presumably content beyond that) - I'm ok with missing that validation (as i expect it to be extremely uncommon). otherwise i'd need to read till EOF?

Jul 01 '22 07:07 radai-rosenblatt

that was an excellent catch @KalleOlaviNiemitalo. i've added a test for File input and adapted the code to work for all cases of dangling input.

except for the case of a buffer full of whitespace with more content hiding in the underlying input (either a reader or an inputstream). I'm ok with not solving for that case as I expect it to be rare (and the code changes required more invasive)

Jul 01 '22 08:07 radai-rosenblatt

@RyanSkraba - any chance of this making it in ?

Aug 09 '22 20:08 radai-rosenblatt

Hello! We released 1.11.1 the other day. It was kind of in a crunch, and I wasn't looking at any of the non-1.11.1 tagged issues :/ My apologies for not getting to this earlier -- it should be in the next release.

To confirm, there's really little or no security issues here right? We strictly just never look farther than the JSON beginning of File.

I like the branch name :D

Aug 11 '22 08:08 RyanSkraba

Ooops, it looks like this broke on master due to the unit test migration to JUnit 5. I'm taking a look.

Jun 14 '23 17:06 RyanSkraba