Content-Type for files uploaded via S3 automatically set to application/xml
Describe the bug
When I upload a file to S3 (using a multipart upload request) the content-type of the file will be application/xml unless I specify otherwise. This seems incorrect as a content-type should be omitted if unknown or, at worst, default to application/octet-stream. Per RFC 7231 (3.1.1.5):
A sender that generates a message containing a payload body SHOULD generate a Content-Type header field in that message unless the intended media type of the enclosed representation is unknown to the sender. If a Content-Type header field is not present, the recipient MAY either assume a media type of "application/octet-stream" ([RFC2046], Section 4.5.1) or examine the data to determine its type.
This ended up causing a bit of confusion here (https://github.com/apache/arrow/issues/11934). An S3 client was trying to be intelligent and inspect the XML data if the file was an XML file and this issue caused the client to inspect files it shouldn't.
Expected behavior
If the content type of a file is not set then the file should either have no content-type or the content-type should be set to application/octet-stream.
Current behavior
The file's content-type is set to application/xml
Steps to Reproduce
Reproducible Gist: https://gist.github.com/westonpace/9c3a0baa48083f33aa4880c0cb6a602b
Possible Solution
When the user does not specify a content-type either leave it unset or default to application/octet-stream
AWS CPP SDK version used
1.8.185
Compiler and Version used
GCC 9.3.0
Operating System and version
Ubuntu 20.04.3
Hi @westonpace , Quick question here before I try to dig too deep into this, have you tried the transferManager to do multipart uploads or is there a reason why you can't? I just tried and I didn't get the same behavior so it might be a good workaround to get you unblocked?
@KaibaLopez Thanks for the suggestions. I was working on the Apache Arrow S3 filesystem adapter which currently does not use the transfer manager (https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc). Although that may be an interesting experiment someday it would add an extra dependency and be a bit more of a change.
I'm not really blocked by this. It was simple enough to ensure we always specify the content type. Perhaps the main issue was simply that this default isn't documented anywhere and so it was a surprise and took a little while to isolate the root cause.