columnify icon indicating copy to clipboard operation
columnify copied to clipboard

logical type from avro input is broken.

Open ryota717 opened this issue 4 years ago • 0 comments

hi, i'm using columnify with avro input record. and found that records of logical types(around datetime: date, timemillis, timemicros, timestampmillis, timestampmicros) are broken.

for example, the sample data gets result below.

# jsonl input(OK)
$ ./columnify -schemaType avro -schemaFile columnifier/testdata/schema/logicals.avsc -recordType jsonl columnifier/testdata/record/logicals.jsonl > jsonl.parquet
$ parquet-tools cat -json jsonl.parquet
{"date":1,"timemillis":1000,"timemicros":1000000,"timestampmillis":1000,"timestampmicros":1000000}
{"date":2,"timemillis":2000,"timemicros":2000000,"timestampmillis":2000,"timestampmicros":2000000}
{"date":3,"timemillis":3000,"timemicros":3000000,"timestampmillis":3000,"timestampmicros":3000000}
{"date":4,"timemillis":4000,"timemicros":4000000,"timestampmillis":4000,"timestampmicros":4000000}
{"date":5,"timemillis":5000,"timemicros":5000000,"timestampmillis":5000,"timestampmicros":5000000}
{"date":6,"timemillis":6000,"timemicros":6000000,"timestampmillis":6000,"timestampmicros":6000000}
{"date":7,"timemillis":7000,"timemicros":7000000,"timestampmillis":7000,"timestampmicros":7000000}
{"date":8,"timemillis":8000,"timemicros":8000000,"timestampmillis":8000,"timestampmicros":8000000}
{"date":9,"timemillis":9000,"timemicros":9000000,"timestampmillis":9000,"timestampmicros":9000000}
{"date":10,"timemillis":10000,"timemicros":10000000,"timestampmillis":10000,"timestampmicros":10000000}

# avro input(NG)
$ ./columnify -schemaType avro -schemaFile columnifier/testdata/schema/logicals.avsc -recordType avro columnifier/testdata/record/logicals.avro > avro.parquet
$ parquet-tools cat -json avro.parquet
{"date":1970,"timemillis":1000000000,"timemicros":1000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":2000000000,"timemicros":2000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":3000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":4000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":5000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":6000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":7000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":8000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":9000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":10000000000,"timestampmillis":1970,"timestampmicros":1970}

this behavior seems to come from goavro that format logical types to go native types(using time). though i dont have good idea to reformat go native types to parquet primitive types before writing :(

ryota717 avatar Jan 28 '22 08:01 ryota717