python-zstandard The open API is inconsistent

Iterating over text data works as expected:

In [2]: for line in zstandard.open('123.zst', 'rt'):
   ...:     print(line)
   ...:
123

Iterating over binary data raises an exception:

In [4]: for line in zstandard.open('123.zst', 'rb'):
   ...:     print(line)
   ...:
---------------------------------------------------------------------------
UnsupportedOperation                      Traceback (most recent call last)
<ipython-input-4-42009a852259> in <module>
----> 1 for line in zstandard.open('123.zst', 'rb'):
      2     print(line)

UnsupportedOperation:

I think it makes sense to wrap binary reader/writer in io.BufferedReader/io.BufferedWriter. Most libraries provide the same interface for text and binary streams.

Feb 05 '21 10:02 rushter

Since builtins.open(..., "rb") returns an io.BufferedReader, I think this change proposal makes sense.

We'll likely want to add an argument to zstandard.open() to control the buffering size, including the option to disable buffering completely, as it does have performance implications.

Feb 16 '21 20:02 indygreg

Thinking about this some more, the compression streams are already buffered. The original bug report uses for line in zstandard.open(..., "rb"). So the missing feature here is iteration support over binary streams, not buffering control.

We certainly could implement a buffering argument for parity with open(). However, it will probably be ignored in many cases.

There is also ambiguity as to where the buffer should go. Is it:

original - buffer - (de)compressor
original - (de)compressor - buffer

Either of these are valid places to insert buffering! Since the buffer location is ambiguous, it is probably best to require the power users who need it to pass a buffered stream into zstandard.open() or wrap its return value with a buffering wrapper.

What do others think?

Oct 16 '21 20:10 indygreg