parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

PARQUET-1408: Make parquet-tools display fields with missing values

Open rushton opened this issue 7 years ago • 4 comments

When using parquet-tools on a parquet file with null records the null columns are omitted from the output.

Example:

scala> case class Foo(a: Int, b: String)
defined class Foo

scala> org.apache.spark.sql.SparkSession.builder.getOrCreate.createDataset((0 to 1000).map(x => Foo(1,null))).write.parquet("/tmp/foobar/")

Pre-patch:

☁  parquet-tools [master] ⚡  java -jar target/parquet-tools-1.10.1-SNAPSHOT.jar cat -j /tmp/foobar/part-00000-436a4d37-d82a-4771-8e7e-e4d428464675-c000.snappy.parquet | head -n5
{"a":1}
{"a":1}
{"a":1}
{"a":1}
{"a":1}

Post-patch:

☁  parquet-tools [master] ⚡  java -jar target/parquet-tools-1.10.1-SNAPSHOT.jar cat -j /tmp/foobar/part-00000-436a4d37-d82a-4771-8e7e-e4d428464675-c000.snappy.parquet | head -n5
{"a":1,"b":null}
{"a":1,"b":null}
{"a":1,"b":null}
{"a":1,"b":null}
{"a":1,"b":null}

rushton avatar Aug 30 '18 21:08 rushton

thanks for the suggestions @zivanfi. Can definitely make 1 happen. I believe 2 is already covered because SimpleRecordConverter is used to convert nested types through SimpleMapRecordConverter and SimpleListRecordConverter which extend from SimpleRecordConverter. Did a sanity check here:

case class Bar(x: Int, y: String);case class Foo(x: Bar, y: Int);
org.apache.spark.sql.SparkSession.builder.getOrCreate.createDataset((0 to 1000).map(x => Foo(Bar(1,null), 23))).write.parquet("/tmp/foobar/")
java -jar target/parquet-tools-1.10.1-SNAPSHOT.jar  cat -json /tmp/foobar/part-00000-35370411-c333-4eb9-9709-64b9d0abe657-c000.snappy.parquet  | head                               
{"x":{"x":1,"y":null},"y":23}          
{"x":{"x":1,"y":null},"y":23}          
{"x":{"x":1,"y":null},"y":23}

I will formalize this into a unit test.

rushton avatar Sep 11 '18 14:09 rushton

@zivanfi updated, let me know if you have any further concerns. Thank you!

rushton avatar Sep 17 '18 21:09 rushton

I'll take another look, thanks.

For future reviews In general I would suggest you to address review comments in a separate commit instead of amending and force-pushing. A separate commit makes it easier to see the changes in the PR and when a committer merges the PR, they will squash the whole PR into a single commit anyway.

It is also recommended to not auto-format whole files, because the resulting whitespace changes add clutter to the PR diff and the git blame history and more importantly can lead to conflicts in the future when cherry-picking, backporting or merging in general. Fortunately in this PR this has only happened to a few comment lines, so it's not a problem here, but for more substantial unrelated whitespace changes we usually ask the author to revert affected lines, which is a lot more work than not changing them in the first place.

zivanfi avatar Sep 18 '18 16:09 zivanfi

I just want to notice that this PR works well for cat command, but i guess that it's not covering other commands like head.

Follow the error that i'm facing with head command:

hadoop jar parquet-tools/target/parquet-tools-1.10.1-SNAPSHOT.jar head --show-nulls --debug example.parquet
java.lang.NumberFormatException: null
	at java.lang.Long.parseLong(Long.java:552)
	at java.lang.Long.parseLong(Long.java:631)
	at org.apache.parquet.tools.command.HeadCommand.execute(HeadCommand.java:82)
	at org.apache.parquet.tools.Main.main(Main.java:223)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:318)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
java.lang.NumberFormatException: null

addomafi avatar Nov 29 '18 10:11 addomafi