s5cmd icon indicating copy to clipboard operation
s5cmd copied to clipboard

Known file corruption risk when copying between s3-compatible and s3-incompatible filesystems

Open kucukaslan opened this issue 3 years ago • 1 comments

Background

The file/object concepts of s3-compatible services and other file systems are different. So their respective file hierarchies(/layouts) does not have one-to-one correspondence.

For example[^bonus] in s3 both ke/y and ke//y are valid object keys. However in many filesystems (especially in Unix-like OSs) slash is file seperator and both of ke/y and ke//y keys are mapped to ke/y that is file y in directory ke. So in sync and cp operations from s3 to such a filesystem, both of those files will be written into same file.

What happens

s5cmd allows concurrent download of different files and different parts of the same file. So it is possible that both ke/y and ke//y objects be downloaded concurrently. If there is a time interval that both download operations opened the file (which they can) then both of them will write to same file and the content of the downloaded file will be corrupted (that is contents of ke/y and ke//y will be arbitrarily interleaved and overridden by one another).

Conclusion

This problem arises from the fundamental incompatibility of s3 and other filesystems. So it doesn't have a solution in a sense.

As an attempt of mitigation, one may propose blocking concurrent writes to same file so that the end file will be copy of either ke/y or ke//y instead of being corrupted. However, this will also block valid concurrent download of the file. So should not be done.

Nevertheless a warning that emphasize these limitations may be given in the readme.

ps. We (with @seruman) have noticed this corruption problem while discussing how "/" and "\" can be handled considering incompatibility of s3, unix-like systems and windows.

[^bonus]: There is also another similar incompatibility problem: in s3 both keand ke/y can coexists. But in unix-like filesystems ke/y implies the existence of directory ke which conflicts with a file named ke.

kucukaslan avatar Aug 16 '22 12:08 kucukaslan

Maybe an idea of a workaround: a user could provide an option with regex or other mapping to convert filepath when syncing between filesystem<=>S3. This would be very useful for uploading files for static sites. For instance the mapping could specify cleaning html extension when uploading built static nextjs site for "/path", "/path/subpath", which has "path.html" and "path/subpath.html" files and "path" folder. While in the bucket the files should be named "path", "path/subpath".

alex-kowalczyk avatar Dec 20 '23 06:12 alex-kowalczyk