merge files fails with small or large files
Sorry if there is already an issue for this. the word "merge" is difficult to search for due to its use in git.
The merge function throws an error if any parts are outside the valid part size range for an s3 multipart upload.
Large input files can become multiple parts using the CopySourceRange parameter of upload_part_copy Smaller parts will either need at least 5MiB of data to be downloaded or temporary files on s3.
Here are some failing tests
def test_merge_small_part(s3):
with s3.open(a, "wb") as f:
f.write(b"a" * 4 * 2**20)
with s3.open(b, "wb") as f:
f.write(b"a" * 10 * 2**20)
s3.merge(test_bucket_name + "/joined", [a, b])
assert s3.info(test_bucket_name + "/joined")["size"] == (4 * 2**20) + (10 * 2**20)
def test_merge_large_part(s3):
with s3.open(a, "wb") as f:
f.write(b"a" * 6 * 2**30)
with s3.open(b, "wb") as f:
f.write(b"a" * 10 * 2**20)
s3.merge(test_bucket_name + "/joined", [a, b])
assert s3.info(test_bucket_name + "/joined")["size"] == (6 * 2**30) + (10 * 2**20)
You are quite right, and this is a limitation of the AWS API. We could implement something which downloads and re-writes smaller chunks. That is basically what open(..., mode="a") does when the existing file is smaller than the 5MB limit. It might get complicated, however, for something like:
file 1: 5GB chunk, 1MB chunk
file 2: 5GB chunk
would require downloading and reuploading the whole of 1MB + 5GB. We are probably better off not supporting this.
That example isn't overly complicated as file 1 can be split into 2 compliant chunks. This one is a little trickier
file 1: 5GB chunk
file 2: 1MB chunk
file 3: 5GB chunk
For this it wouldn't need to fully download file 3 only 4MiB from the start
There is also a route without downloading data by merging file 1 and file 2 to a temp file and then merging the temp file and file 3.
I may have to do one of these for some work. I will report back if i develop something generally useful.
I believe the whole point of the limitations in S3 merge, is that the remote service will not split chunks at all. So in your example, you can read 4MB of file3, but you cannot copy the remaining (5GB - 4MB) server-side - you must copy the whole chunk or not at all.
That example isn't overly complicated as file 1 can be split into 2 compliant chunks.
I don't think it can, you cannot split any chunk.
If this was the case then _copy_managed would have plenty of issues with files written with unusual part sizes but it doesn't.
A little experimentation tells me this isn't an issue as long as parts are within the api size limits.
out_path = f"{test_bucket_name}/test/mpu_parts.tif"
# specify some files and byte ranges wanted from those files
# these are not part aligned in the source files
copy_parts = [(a,5,(10 * 2**20)+5),
(b,0,100),
]
bucket, key, _ = s3.split_path(out_path)
mpu = s3.call_s3("create_multipart_upload", Bucket=bucket, Key=key)
out_path = f"{test_bucket_name}/test/mpu_parts.tif"
copy_parts = [(a,5,(10 * 2**20)+5),
(b,0,100),
]
bucket, key, _ = s3.split_path(out_path)
mpu = s3.call_s3("create_multipart_upload", Bucket=bucket, Key=key)
mpu_parts = []
for part_number,(source_path,brange_first, brange_last) in enumerate(copy_parts,start=1):
# brange_last-1 to match standard python slice semantics
part_mpu = s3.call_s3(
"upload_part_copy",
Bucket=mpu["Bucket"],
Key=mpu["Key"],
PartNumber=part_number,
UploadId=mpu["UploadId"],
CopySource=source_path,
CopySourceRange=f"bytes={brange_first}-{brange_last-1}",
)
mpu_parts.append({"PartNumber": part_number, "ETag": part_mpu["CopyPartResult"]["ETag"]})
fin = s3.call_s3(
"complete_multipart_upload",
Bucket=bucket,
Key=key,
UploadId=mpu["UploadId"],
MultipartUpload={"Parts": mpu_parts},
)
assert s3.info(out_path)["size"] == (10 * 2**20) + 100
As to the intent by AWS I suspect it is because parts are not merged internally on S3 and storing a small file for each part is inefficient. There is supposedly some performance benefit from making requests that align to part boundaries but i've never tested this.
x-amz-copy-source-range The range of bytes to copy from the source object. The range value must use the form bytes=first-last, where the first and last are the zero-based byte offsets to copy. For example, bytes=0-9 indicates that you want to copy the first 10 bytes of the source. You can copy a range only if the source object is greater than 5 MB.
The restriction on the minimum chunk size does not apply to the very last chunk of the output file, however. The following should fail:
copy_parts = [
(a,5,(10 * 2**20)+5),
(b,0,100),
(a,5,(10 * 2**20)+5)
]
However, I now see where you are coming from with regard to CopySourceRange, and it should be possible to merge 5MB minimum chunks by download and everything else by remote copy, but the logic to do this seems a little complex.
@artttt , do you think you would like to work towards a solution?
Yes I would like to. I'm about to be away from the computer for a while so I will pencil it into my schedule to have a look in late Oct.
ping - this got lost
@martindurant apologies for the delay. This didn't get lost, i've been unwell.
I've done some exploring towards a solution. It seems to work well. As you identified the logic gets a little complex but i think i've managed to handle most of the edge cases that i could identify.
It's not very pretty at this stage so i've dropped it into a gist rather then a PR for the moment.
https://gist.github.com/artttt/a5e9a080936615d9519104db85d03ecb
Thanks for having a go. That does look rather long! Please ping me when pydata is over and I should have time to look.
Do you think you will have the chance to clean up that code and contribute a PR?
Hi Martin, Unfortunately i'm unwell again and not able to do any work on this.