b2-sdk-python icon indicating copy to clipboard operation
b2-sdk-python copied to clipboard

disable decompression on download (with content-encoding=gzip header)

Open tobiathan opened this issue 5 years ago • 19 comments

Create a B2Response subclass that inherits from Response and override iter_content() to set decode_content=False in call to raw.stream(). Use B2Response for downloads so that when files are uploaded with b2-content-encoding=gzip, the client doesn't decompress the uploaded data.

tobiathan avatar Jan 05 '21 05:01 tobiathan

Indeed, someone could rely on this behavior already, so changing this could break their code.

I suggest the following plan:

  1. In apiver v1 expose a new parameter b2ContentEncoding=None for download_file_by_id and download_file_by_name. This will let users of b2sdk apiver v1 (specifically b2cli) change the behavior by using "identity" value for this parameter, so that the downloaded file is retrieved exactly the same as it was uploaded.
  2. In apiver v2 change the default value so that b2ContentEncoding="identity". Any clients migrating from apiver v1 to apiver v2 will then have to decide whether they want to keep automatic decompression (then they just need to set the new parameter to None), or maybe they'd like to use the new default in order to get the exact same data that has been uploaded.

This way we'll change the default behavior of b2sdk while not breaking anything for the existing users of b2sdk. We'll also get to fix it for b2cli users right away... But unfortunately b2cli users may also depend on that odd behavior (which I personally consider to be an ancient bug, but nevertheless I don't want to break backup/restore scripts for anyone), so in order to release this fix, we'll have to release a major version of b2cli.

Does this make sense? @tobiathan @eallrich @mlech-reef

ppolewicz avatar Jan 06 '21 17:01 ppolewicz

These are interesting ideas, but there are some questions I'd raise here for your consideration.

First: What does it mean when you say that we're 'changing' the default behavior of b2sdk? Currently, using the b2-python-sdk through the b2cli, if a user has uploaded with b2-content-encoding=gzip, then the download actually breaks, e.g.,

Traceback (most recent call last): File "b2/console_tool.py", line 1500, in run_command File "b2/console_tool.py", line 645, in run File "logfury/v0_1/trace_call.py", line 84, in wrapper File "b2sdk/api.py", line 230, in download_file_by_id File "logfury/v0_1/trace_call.py", line 84, in wrapper File "b2sdk/transfer/inbound/download_manager.py", line 125, in download_file_from_url File "b2sdk/transfer/inbound/download_manager.py", line 132, in _validate_download b2sdk.exception.TruncatedOutput: only 1442460 of 530585 bytes read ERROR: only 1442460 of 530585 bytes read

So at first glance, it seems to me that this PR (or any other that aims to address the same issue) isn't breaking anything for existing users of b2sdk; it's fixing something that wasn't working properly to begin with. Fixing that doesn't somehow break a currently working behavior; it addresses a currently broken behavior. Am I missing something here (which is very possible)?

Secondly: I think the idea of overriding content-encoding (passing identity on the download) is really clever! But it would also mean that the headers which are sent on download aren't in fact what the client originally specified at upload time. Does that matter? I think it probably does; we want to preserve the headers that the user specified, but also expect to receive the content as the user uploaded it.

ericjding avatar Jan 06 '21 17:01 ericjding

I think it might have worked correctly for files that had a .gz extension, but were actually nearly non-compressible and ended up with the same size as before compression. That's a bit of a stretch though, anyone relying on this behavior would encounter massive reliability issues. b2sdk.exception.TruncatedOutput goes way back before transferer, so that's not a new thing either.

As for your second remark, I'm not really sure. The user specified the header he wants to use and any client when seeing that header should behave in a certain way that is described by RFC. This is actually a nice feature, you can compress things for storage in the cloud, set a header and then expect any client (not just one based on b2-sdk) to decompress the file upon retrieval. How would a browser behave in such case, or wget, or curl? Should b2sdk behave differently? Should b2cli behave differently? Why? Maybe this behavior should be decided by a parameter (command line flag or method parameter), but if that, what should be the default?

Personally I think b2cli should not decompress anything as it is a special kind of client that should work with raw data. It should also preserve headers, as you mentioned, though if b2cli/b2sdk would need to have a limitation of overriding this one field in headers during download operation, so if someone would like to download the file and check the original value (provided during upload) of this field at the same time, they'd need to perform a separate get_file_info (for ease of implementation (like not copying a part of requests) and potentially for performance reasons), then I could live with that. It is not really a crippling limitation. Should b2sdk decompress data for the user? I think it should do no such thing by default. Should there be an option to enable decompression? Maybe. If someone one day requests such an option and contributes the code for it or convinces others to contribute the code for it, then why not. I don't think this option would be used by a wide range of people, so having no such option would be very acceptable (though if during implementation we'll end up adding it, that's also fine). We should check what curl and wget will do, specifically we should also check what would a new b2sdk b2http backend based on pycurl do. This might impact our solution, if we are thinking about forward compatibility.

ppolewicz avatar Jan 06 '21 23:01 ppolewicz

Wget apparently doesn't support Content-Encoding and in the recent versions sends 'Accept-encoding: identity' Curl does similar but has --compressed flag to automatically decompress the data. PyCurl has this option customizable, but it's disabled by default - https://curl.se/libcurl/c/CURLOPT_ACCEPT_ENCODING.html

mlech-reef avatar Jan 07 '21 14:01 mlech-reef

What about the browser?

ppolewicz avatar Jan 07 '21 17:01 ppolewicz

What about the browser?

All the browsers that I have installed accept Content-Encoding. I would be very surprised if they don't :)

mlech-reef avatar Jan 07 '21 19:01 mlech-reef

Firefox 64 and 84 on linux did not decompress it, everything else I have decompressed it.

I'm starting to think we should fix the decompression and we should think about adding compression support to sync and upload-file (and maybe copy-file, but that's more thinking), though an option to not use it should be provided for dealing with damaged file (so that the user doesn't need to resort to wget with an authorized url in such case).

ppolewicz avatar Jan 08 '21 01:01 ppolewicz

(sorry, I accidentally posted this at first under @tobiathan's account)

In case you feel it's relevant: we are definitely fixing the behavior on the b2-sdk-java so that it does not decompress the file in download, since that breaks the checksum/size comparison in that SDK too. In the case of the Java SDK, the underlying library we use gives us the flexibility to make it an option if we so choose; that's less the case here, though I suppose it could be if we wanted.

I'll probably work with @tobiathan to refine this PR somewhat and add unit tests over the next few days unless you absolutely object and think we should go another way altogether. I'm happy for the constructive conversation to continue.

ericjding avatar Jan 08 '21 04:01 ericjding

Ok, after further discussion we came to a conclusion, that this feature of automatic decompression of content using the content-encoding header should not be fixed. Yes, chrome and curl with --compressed may respect that header (and it is in HTTP/1.1 since 1997), but the way it is implemented right now (by the cloud side, by only having/offering one version of the file) is making it unsuitable for implementation of compression the way we would want it. In other words, even if we wanted to implement compression, this is not the way we would do it (as it only supports "easy" compression like gzip or brotli anyway, and not bz2 / lzma which has better ratios).

Therefore the way to fix this issue is to disable the decompression behavior, but that can be done in several ways and it seems that the easiest and cleanest way for b2sdk is to use ~Accept-encoding: identity header~ (server ignores it!) b2ContentEncoding=identity parameter for download requests.

@ericjding @tobiathan will you remake the PR in such manner?

ppolewicz avatar Jan 08 '21 15:01 ppolewicz

Sure, we can remake the PR, but I'm a little confused by what you said by way of explanation. In the first part, you said, "this feature of automatic decompression of content using the content-encoding header should not be fixed. Then below, you write, "the way to fix this issue is to disable the decompression behavior..."

Is this another way of saying you don't want to change the way iter_content() behaves, but instead keep the content-encoding: gzip header from being sent?

ericjding avatar Jan 08 '21 17:01 ericjding

Ok so the problem is that downloading a file with content-encoding=gzip doesn't work, but there are two possible solutions:

  1. Fix it so that it works like Chrome does and HTTP/1.1 says (download the file and decompress it for the user)
  2. Fix it so that it works like wget does (download the file in its raw form and leave it like that, violating HTTP/1.1 spec)

Now having a cross-implementation support for compression, whereas the user could choose to gzip a file when uploading it, where it would be ungzipped automatically upon retrieval, seems like a very nice feature to have. Unfortunately, there are two problems with it:

  1. It doesn't support other compression formats across those many platforms, specifically lzma2
  2. Wget doesn't support it at all
  3. Curl only supports it with an option
  4. Firefox on linux doesn't support it (despite some sources claiming that it does - but my test showed an output still compressed)
  5. We'd have to separately store a post-decompression checksum if we'd like to verify the integrity of the file somewhere.

So it looks that if we would be implemeting compression support (which by the way we might want to discuss, since we are deep in the subject already), we'd also need to use another type of a header... though when I think about it now, the user might be warned when selecting possible compression algorithms (lzma2 specifically) that not all clients will respect it as it is not officially registered by the RFC, so he'd need a client from a shorter list to download and curl --compress would not be enough.

There was a curl bug somewhere where a user complained content-encoding is not supported, where curl maintainers argued it will only be supported if curl sends an accept-encoding request first, but it will not be supported if the server suddenly provides a header on its own. I suspect this might be why it doesn't work in Firefox.

I'm still a little bit on the fence with this. If we decide today to ignore content-encoding (which is supported by the b2 server, I think so that we can support compression, because why would it support it?), then how are we going to add this support in the future?

If b2-sdk-java and b2-sdk-python would agree to support compression somehow (through RFC7231 content-encoding or in any other manner), perhaps extending the header list with non-standard formats (bz2, lzma2), we could also work with other implementations to introduce the support (if it's not there yet) and get it working everywhere.

ppolewicz avatar Jan 08 '21 22:01 ppolewicz

Thanks for going into more depth. I'd like to get some more input from other B2 developers and consultants internally before making a final decision, both to get a sense of the history/rationale for this future (and thus its originally intended use cases) and of a way forward that makes sense and is consistent between (at least) b2-sdk-java and b2-sdk-python. I know that a fix/change to make b2-java-sdk ignore content-encoding: gzip was submitted last week, but I appreciate the thoughtfulness you guys are bringing to the table as we think through alternatives. I'll reach out next week after I've had a chance to gather some more information!

ericjding avatar Jan 09 '21 03:01 ericjding

Also, in initial testing with the suggested approach of overriding b2ContentEncoding, I'm running into problems because, as the b2 docs say, "Requests with this specified must also have an authorization token." So the integration tests fail on b2_download_file_by_id (no auth) when I add the b2ContentEncoding=identity URL query param. That may turn out to be a showstopper for that approach.

ericjding avatar Jan 09 '21 05:01 ericjding

Indeed, that's a problem.

As for b2-sdk-java, I think it has not released a version with the fix yet, so I think it's not too late. I mean we can always release a major version to break compatibility, but here even that will not be needed.

I'm awaiting the results of the internal discussion, please keep us posted.

ppolewicz avatar Jan 10 '21 06:01 ppolewicz

After a next round of discussions, it was decided that at least in "default configuration", the automatic decompression will NOT be supported by either b2-sdk-java nor b2-sdk-python. As b2-sdk-python is going to switch http backends soon, we'll pause #177 for a couple of weeks until we'll know if requests will still be in play, and only then we'll decide whether to close the PR completely, or fix it (and how).

ppolewicz avatar Jan 11 '21 21:01 ppolewicz

This was resolved in another PR a long while ago

ppolewicz avatar May 20 '21 15:05 ppolewicz

@ppolewicz do you know which PR it was resolved in? I'd like to note it in our internal issue tracker too. thx

ericjding avatar May 20 '21 17:05 ericjding

I can't find it in changelog. I'm afraid it wasn't actually deployed.

ppolewicz avatar May 20 '21 22:05 ppolewicz

after I closed the ticket I realized it would be nice to say in which version this was done and I marked a task for myself to find it, but it's not in the changelog. Will look at this again tomorrow.

ppolewicz avatar May 20 '21 22:05 ppolewicz

resolved in https://github.com/Backblaze/b2-sdk-python/releases/tag/v1.17.0 , i.e. long time ago

mjurbanski-reef avatar Mar 25 '24 11:03 mjurbanski-reef