Passing a gzipfile into pycapnp for io doesn't work because gzipfile.fileno does not reference uncompressed data

Open xilvar opened this issue 10 years ago • 1 comments

Frequently capnp message files would be gzip compressed in order to achieve much smaller files. For example, I happen to be working with capnproto log files that are ~200MB/minute gzipped and ~800MB/minute uncompressed.

As it happens, the gzipfile implementation in both python 2 and 3 exposes the fileno method in a weird abstraction breaking way. fileno for a gzipfile is actually the fd for the underlying still compressed file. Thus when pycapnp is given a gzipfile it ends up attempting to read gzipped data as if it is plain capnp data and dying with a fairly mysterious error.

For example something like this:

import capnp, gzip
fi = gzip.open("log/logs-2015_08_13-00_42_00.log.gz")
schema = capnp.load("capnp/message.capnp")
g = schema.Message.read_multiple(fi)
for message in g:
  pass

results in an error of: capnp.lib.capnp.KjException: src/capnp/serialize.c++:159: failed: expected segmentCount < 512; Message has too many segments.

Whereas doing this works fine:

import capnp, os
fi = os.popen("cat log/logs-2015_08_13-00_42_00.log.gz | gzip -d")
schema = capnp.load("capnp/message.capnp")
g = schema.Message.read_multiple(fi)
for message in g:
  pass

Any thoughts on a fitting way to alter the library to allow use of gzipfiles? And some way to make the error less confusing?

Aug 19 '15 21:08 xilvar

As you observed, pycapnp duck types based on a fileno method. We should probably instead filter to known working classes instead, but it makes me a bit sad.

To answer your question, reading actually occurs in the C++ libcapnp and then pycapnp just wraps everything that comes out. We'd probably have to duck type python objects that have a read method and wrap them in a MessageReader implementation similar to InputStreamMessageReader (see https://github.com/sandstorm-io/capnproto/blob/master/c%2B%2B/src/capnp/serialize.h for how this looks). I don't have a lot of time to work on this, but I'm happy to accept a PR :)

Alternatively, your method of calling out to gzip isn't that bad, and is probably going to be faster than the aforementioned method anyways.

Aug 20 '15 05:08 jparyani