PARQUET-1408: Make parquet-tools display fields with missing values
When using parquet-tools on a parquet file with null records the null columns are omitted from the output.
Example:
scala> case class Foo(a: Int, b: String)
defined class Foo
scala> org.apache.spark.sql.SparkSession.builder.getOrCreate.createDataset((0 to 1000).map(x => Foo(1,null))).write.parquet("/tmp/foobar/")
Pre-patch:
☁ parquet-tools [master] ⚡ java -jar target/parquet-tools-1.10.1-SNAPSHOT.jar cat -j /tmp/foobar/part-00000-436a4d37-d82a-4771-8e7e-e4d428464675-c000.snappy.parquet | head -n5
{"a":1}
{"a":1}
{"a":1}
{"a":1}
{"a":1}
Post-patch:
☁ parquet-tools [master] ⚡ java -jar target/parquet-tools-1.10.1-SNAPSHOT.jar cat -j /tmp/foobar/part-00000-436a4d37-d82a-4771-8e7e-e4d428464675-c000.snappy.parquet | head -n5
{"a":1,"b":null}
{"a":1,"b":null}
{"a":1,"b":null}
{"a":1,"b":null}
{"a":1,"b":null}
thanks for the suggestions @zivanfi. Can definitely make 1 happen. I believe 2 is already covered because SimpleRecordConverter is used to convert nested types through SimpleMapRecordConverter and SimpleListRecordConverter which extend from SimpleRecordConverter. Did a sanity check here:
case class Bar(x: Int, y: String);case class Foo(x: Bar, y: Int);
org.apache.spark.sql.SparkSession.builder.getOrCreate.createDataset((0 to 1000).map(x => Foo(Bar(1,null), 23))).write.parquet("/tmp/foobar/")
java -jar target/parquet-tools-1.10.1-SNAPSHOT.jar cat -json /tmp/foobar/part-00000-35370411-c333-4eb9-9709-64b9d0abe657-c000.snappy.parquet | head
{"x":{"x":1,"y":null},"y":23}
{"x":{"x":1,"y":null},"y":23}
{"x":{"x":1,"y":null},"y":23}
I will formalize this into a unit test.
@zivanfi updated, let me know if you have any further concerns. Thank you!
I'll take another look, thanks.
For future reviews In general I would suggest you to address review comments in a separate commit instead of amending and force-pushing. A separate commit makes it easier to see the changes in the PR and when a committer merges the PR, they will squash the whole PR into a single commit anyway.
It is also recommended to not auto-format whole files, because the resulting whitespace changes add clutter to the PR diff and the git blame history and more importantly can lead to conflicts in the future when cherry-picking, backporting or merging in general. Fortunately in this PR this has only happened to a few comment lines, so it's not a problem here, but for more substantial unrelated whitespace changes we usually ask the author to revert affected lines, which is a lot more work than not changing them in the first place.
I just want to notice that this PR works well for cat command, but i guess that it's not covering other commands like head.
Follow the error that i'm facing with head command:
hadoop jar parquet-tools/target/parquet-tools-1.10.1-SNAPSHOT.jar head --show-nulls --debug example.parquet
java.lang.NumberFormatException: null
at java.lang.Long.parseLong(Long.java:552)
at java.lang.Long.parseLong(Long.java:631)
at org.apache.parquet.tools.command.HeadCommand.execute(HeadCommand.java:82)
at org.apache.parquet.tools.Main.main(Main.java:223)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.run(RunJar.java:318)
at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
java.lang.NumberFormatException: null