s3fs icon indicating copy to clipboard operation
s3fs copied to clipboard

merge files fails with small or large files

Open artttt opened this issue 2 years ago • 12 comments

Sorry if there is already an issue for this. the word "merge" is difficult to search for due to its use in git.

The merge function throws an error if any parts are outside the valid part size range for an s3 multipart upload.

Large input files can become multiple parts using the CopySourceRange parameter of upload_part_copy Smaller parts will either need at least 5MiB of data to be downloaded or temporary files on s3.

Here are some failing tests

def test_merge_small_part(s3):
    with s3.open(a, "wb") as f:
        f.write(b"a" * 4 * 2**20)

    with s3.open(b, "wb") as f:
        f.write(b"a" * 10 * 2**20)
    s3.merge(test_bucket_name + "/joined", [a, b])
    assert s3.info(test_bucket_name + "/joined")["size"] == (4 * 2**20) + (10 * 2**20)

def test_merge_large_part(s3):
    with s3.open(a, "wb") as f:
        f.write(b"a" * 6 * 2**30)

    with s3.open(b, "wb") as f:
        f.write(b"a" * 10 * 2**20)
    s3.merge(test_bucket_name + "/joined", [a, b])
    assert s3.info(test_bucket_name + "/joined")["size"] == (6 * 2**30) + (10 * 2**20)

artttt avatar Sep 04 '23 03:09 artttt

You are quite right, and this is a limitation of the AWS API. We could implement something which downloads and re-writes smaller chunks. That is basically what open(..., mode="a") does when the existing file is smaller than the 5MB limit. It might get complicated, however, for something like:

file 1: 5GB chunk, 1MB chunk
file 2: 5GB chunk

would require downloading and reuploading the whole of 1MB + 5GB. We are probably better off not supporting this.

martindurant avatar Sep 05 '23 14:09 martindurant

That example isn't overly complicated as file 1 can be split into 2 compliant chunks. This one is a little trickier

file 1: 5GB chunk
file 2: 1MB chunk
file 3: 5GB chunk

For this it wouldn't need to fully download file 3 only 4MiB from the start

There is also a route without downloading data by merging file 1 and file 2 to a temp file and then merging the temp file and file 3.

I may have to do one of these for some work. I will report back if i develop something generally useful.

artttt avatar Sep 06 '23 04:09 artttt

I believe the whole point of the limitations in S3 merge, is that the remote service will not split chunks at all. So in your example, you can read 4MB of file3, but you cannot copy the remaining (5GB - 4MB) server-side - you must copy the whole chunk or not at all.

That example isn't overly complicated as file 1 can be split into 2 compliant chunks.

I don't think it can, you cannot split any chunk.

martindurant avatar Sep 06 '23 14:09 martindurant

If this was the case then _copy_managed would have plenty of issues with files written with unusual part sizes but it doesn't.

A little experimentation tells me this isn't an issue as long as parts are within the api size limits.

out_path = f"{test_bucket_name}/test/mpu_parts.tif"

# specify some files and byte ranges wanted from those files
# these are not part aligned in the source files
copy_parts = [(a,5,(10 * 2**20)+5),
              (b,0,100),
            ]
bucket, key, _ = s3.split_path(out_path)

mpu = s3.call_s3("create_multipart_upload", Bucket=bucket, Key=key)

out_path = f"{test_bucket_name}/test/mpu_parts.tif"
copy_parts = [(a,5,(10 * 2**20)+5),
              (b,0,100),
            ]


bucket, key, _ = s3.split_path(out_path)

mpu = s3.call_s3("create_multipart_upload", Bucket=bucket, Key=key)

mpu_parts = []
for part_number,(source_path,brange_first, brange_last) in enumerate(copy_parts,start=1):
    # brange_last-1 to match standard python slice semantics
    part_mpu = s3.call_s3(
                    "upload_part_copy",
                    Bucket=mpu["Bucket"],
                    Key=mpu["Key"],
                    PartNumber=part_number,
                    UploadId=mpu["UploadId"],
                    CopySource=source_path,
                    CopySourceRange=f"bytes={brange_first}-{brange_last-1}",
                )
    mpu_parts.append({"PartNumber": part_number, "ETag": part_mpu["CopyPartResult"]["ETag"]})
        
    
fin = s3.call_s3(
            "complete_multipart_upload",
            Bucket=bucket,
            Key=key,
            UploadId=mpu["UploadId"],
            MultipartUpload={"Parts": mpu_parts},
        )
assert s3.info(out_path)["size"] == (10 * 2**20) + 100

As to the intent by AWS I suspect it is because parts are not merged internally on S3 and storing a small file for each part is inefficient. There is supposedly some performance benefit from making requests that align to part boundaries but i've never tested this.

artttt avatar Sep 07 '23 03:09 artttt

x-amz-copy-source-range The range of bytes to copy from the source object. The range value must use the form bytes=first-last, where the first and last are the zero-based byte offsets to copy. For example, bytes=0-9 indicates that you want to copy the first 10 bytes of the source. You can copy a range only if the source object is greater than 5 MB.

The restriction on the minimum chunk size does not apply to the very last chunk of the output file, however. The following should fail:

copy_parts = [
    (a,5,(10 * 2**20)+5),
    (b,0,100),
    (a,5,(10 * 2**20)+5)
]

However, I now see where you are coming from with regard to CopySourceRange, and it should be possible to merge 5MB minimum chunks by download and everything else by remote copy, but the logic to do this seems a little complex.

martindurant avatar Sep 07 '23 13:09 martindurant

@artttt , do you think you would like to work towards a solution?

martindurant avatar Oct 04 '23 15:10 martindurant

Yes I would like to. I'm about to be away from the computer for a while so I will pencil it into my schedule to have a look in late Oct.

artttt avatar Oct 05 '23 01:10 artttt

ping - this got lost

martindurant avatar Nov 20 '23 19:11 martindurant

@martindurant apologies for the delay. This didn't get lost, i've been unwell.

I've done some exploring towards a solution. It seems to work well. As you identified the logic gets a little complex but i think i've managed to handle most of the edge cases that i could identify.

It's not very pretty at this stage so i've dropped it into a gist rather then a PR for the moment.

https://gist.github.com/artttt/a5e9a080936615d9519104db85d03ecb

artttt avatar Dec 05 '23 10:12 artttt

Thanks for having a go. That does look rather long! Please ping me when pydata is over and I should have time to look.

martindurant avatar Dec 05 '23 15:12 martindurant

Do you think you will have the chance to clean up that code and contribute a PR?

martindurant avatar Apr 22 '24 19:04 martindurant

Hi Martin, Unfortunately i'm unwell again and not able to do any work on this.

artttt avatar Apr 24 '24 00:04 artttt