zed icon indicating copy to clipboard operation
zed copied to clipboard

Auto-detect all input formats for files

Open nwt opened this issue 4 years ago • 3 comments

zed doesn't try to detect CSV, Parquet, JSON, or ZST inputs, but for file inputs, it should.

It doesn't because zio/detector.NewReaderWithOpts wraps its io.Reader parameter with Track and Recorder, which don't implement io.ReadSeeker and so aren't compatible with zio/parquetio.NewReader or zio/zstio.NewReader. But if the io.Reader parameter does implement io.ReadSeeker, NewReaderWithOpts can try Parquet, JSON, and ZST first, using Seek to rewind for the next format after each try.

nwt avatar Apr 08 '21 22:04 nwt

Note to self: For now I've added a comment in the "Custom Brimcap Configuration" article linking to this as an open issue, so that way the reader understands the current limitation is temporary. If/when we address this issue, I should update the wiki article to remove the workaround and the comment. The same is true for the "Importing CSV, JSON, Parquet, and ZST (v0.25.0+)" article in the Brim wiki.

philrz avatar Jun 08 '21 23:06 philrz

I'm pleased to report that JSON auto-detect support was added via #3124.

philrz avatar Oct 13 '21 18:10 philrz

Another update: CSV auto-detect support was added via #3277.

philrz avatar Apr 05 '22 17:04 philrz

Verified in Zed commit e569877.

Here's an example of reading each of the supported formats (with the exception of line) using auto-detect.

$ zq -version
Version: v1.3.0-42-ge5698777

$ for format in arrows zng vng json zeek zjson csv parquet zson; do   echo '{"hello": "world", "pi": 3.14}' | zq -f $format -o sample.$format - ; done

$ for file in *; do echo -n "$file: ";   zq 'count()' $file; done
sample.arrows: {count:1(uint64)}
sample.csv: {count:1(uint64)}
sample.json: {count:1(uint64)}
sample.parquet: {count:1(uint64)}
sample.vng: {count:1(uint64)}
sample.zeek: {count:1(uint64)}
sample.zjson: {count:1(uint64)}
sample.zng: {count:1(uint64)}
sample.zson: {count:1(uint64)}

line format still requires an explicit -i, since there's no way to automatically determine intent to treat input this way.

$ zq -i line sample.zson
"{"
"    hello: \"world\","
"    pi: 3.14"
"}"

$ zq -i line sample.csv
"hello,pi"
"world,3.14"

And, as mentioned in #4270, the non-seekable formats like Parquet and VNG aren't readable if compressed, whether via auto-detect or expliit.

$ gzip *

$ for file in *; do echo -n "$file: ";   zq 'count()' $file; done
sample.arrows.gz: {count:1(uint64)}
sample.csv.gz: {count:1(uint64)}
sample.json.gz: {count:1(uint64)}
sample.parquet.gz: sample.parquet.gz: format detection error
	arrows: schema message length exceeds 1 MiB
	zeek: line 1: bad types/fields definition in zeek header
	zjson: line 1: invalid character 'P' looking for beginning of value
	zson: ZSON syntax error
	zng: malformed zng record
	csv: record on line 2: wrong number of fields
	json: invalid character 'P' looking for beginning of value
	parquet: auto-detection requires seekable input
	vng: auto-detection requires seekable input
	line: auto-detection not supported
sample.vng.gz: sample.vng.gz: format detection error
	arrows: schema message length exceeds 1 MiB
	zeek: line 1: bad types/fields definition in zeek header
	zjson: line 1: invalid character '\x06' looking for beginning of value
	zson: ZSON syntax error
	zng: truncated input
	csv: line 1: no comma found
	json: invalid character '\x06' looking for beginning of value
	parquet: auto-detection requires seekable input
	vng: auto-detection requires seekable input
	line: auto-detection not supported
sample.zeek.gz: {count:1(uint64)}
sample.zjson.gz: {count:1(uint64)}
sample.zng.gz: {count:1(uint64)}
sample.zson.gz: {count:1(uint64)}

$ zq -i parquet sample.parquet.gz 
sample.parquet.gz: reader cannot seek

$ zq -i vng sample.vng.gz 
sample.vng.gz: VNG must be used with a seekable input

Thanks @nwt!

philrz avatar Dec 24 '22 22:12 philrz