Passing a gzipfile into pycapnp for io doesn't work because gzipfile.fileno does not reference uncompressed data
Frequently capnp message files would be gzip compressed in order to achieve much smaller files. For example, I happen to be working with capnproto log files that are ~200MB/minute gzipped and ~800MB/minute uncompressed.
As it happens, the gzipfile implementation in both python 2 and 3 exposes the fileno method in a weird abstraction breaking way. fileno for a gzipfile is actually the fd for the underlying still compressed file. Thus when pycapnp is given a gzipfile it ends up attempting to read gzipped data as if it is plain capnp data and dying with a fairly mysterious error.
For example something like this:
import capnp, gzip
fi = gzip.open("log/logs-2015_08_13-00_42_00.log.gz")
schema = capnp.load("capnp/message.capnp")
g = schema.Message.read_multiple(fi)
for message in g:
pass
results in an error of:
capnp.lib.capnp.KjException: src/capnp/serialize.c++:159: failed: expected segmentCount < 512; Message has too many segments.
Whereas doing this works fine:
import capnp, os
fi = os.popen("cat log/logs-2015_08_13-00_42_00.log.gz | gzip -d")
schema = capnp.load("capnp/message.capnp")
g = schema.Message.read_multiple(fi)
for message in g:
pass
Any thoughts on a fitting way to alter the library to allow use of gzipfiles? And some way to make the error less confusing?
As you observed, pycapnp duck types based on a fileno method. We should probably instead filter to known working classes instead, but it makes me a bit sad.
To answer your question, reading actually occurs in the C++ libcapnp and then pycapnp just wraps everything that comes out. We'd probably have to duck type python objects that have a read method and wrap them in a MessageReader implementation similar to InputStreamMessageReader (see https://github.com/sandstorm-io/capnproto/blob/master/c%2B%2B/src/capnp/serialize.h for how this looks). I don't have a lot of time to work on this, but I'm happy to accept a PR :)
Alternatively, your method of calling out to gzip isn't that bad, and is probably going to be faster than the aforementioned method anyways.