The open API is inconsistent
Iterating over text data works as expected:
In [2]: for line in zstandard.open('123.zst', 'rt'):
...: print(line)
...:
123
Iterating over binary data raises an exception:
In [4]: for line in zstandard.open('123.zst', 'rb'):
...: print(line)
...:
---------------------------------------------------------------------------
UnsupportedOperation Traceback (most recent call last)
<ipython-input-4-42009a852259> in <module>
----> 1 for line in zstandard.open('123.zst', 'rb'):
2 print(line)
UnsupportedOperation:
I think it makes sense to wrap binary reader/writer in io.BufferedReader/io.BufferedWriter.
Most libraries provide the same interface for text and binary streams.
Since builtins.open(..., "rb") returns an io.BufferedReader, I think this change proposal makes sense.
We'll likely want to add an argument to zstandard.open() to control the buffering size, including the option to disable buffering completely, as it does have performance implications.
Thinking about this some more, the compression streams are already buffered. The original bug report uses for line in zstandard.open(..., "rb"). So the missing feature here is iteration support over binary streams, not buffering control.
We certainly could implement a buffering argument for parity with open(). However, it will probably be ignored in many cases.
There is also ambiguity as to where the buffer should go. Is it:
- original - buffer - (de)compressor
- original - (de)compressor - buffer
Either of these are valid places to insert buffering! Since the buffer location is ambiguous, it is probably best to require the power users who need it to pass a buffered stream into zstandard.open() or wrap its return value with a buffering wrapper.
What do others think?