s5cmd Large memory utilisation when syncing many files

Problem

I have a problem when I try to sync a large amount of files, approximately 400 000 files. When I do this around 11 GB of memory is used. It works for now but will become problematic when I try to move this process to a weaker VM. I did some debugging and concluded that the problem is when the metadata of the files(type Object) are fetched and stored in-memory in the function getSourceAndDestinationObjects.

Proposed solution

The solution I propose is to implement a new flag that will use external sorting instead of in-memory sorting. This should be an optional flag as the performance will be lower for this mode. I do not have any experience with external sorting but I would like to help by working on this after we have decided how this could be solved, I am of course open to new ideas for solving this problem.

Version: v2.0.0-beta.2

May 16 '22 14:05 misuto

Since the order is only used to determine whether one object is only in source, only in destination or in both, maybe partitioning it to files is easier and more performant than external sorting. Then you only need to read the contents of the right partition (say, apply some hash and use the result), if one object is present only in the right set then it will be present in the file that matches its hash for the right set and not in the file that matches its hash for the left set. You can build a set in memory per file and discard it after you're done with it. It would need to be a different hash than whatever Go uses for maps tho, otherwise your set will be just a list 🤔 You can of course create a list with the contents of each file and sort that, and after that use the same old algorithm. In any case, I see many temporary allocations could come from the appends used willy nilly in that part of the code, and from the fact it does get everything together (which is not completely avoidable).

Jun 10 '22 21:06 Oppen

Thanks, a lot to both of you!

Current logic of sync

To determine which files are to be copied, sync commands need to identify which files are present in only source, only destination, and in both. The straightforward (and a good) solution is:

to obtain list of files in both source and destination:

https://github.com/peak/s5cmd/blob/123b1c7fc9c614aa214a6795468fa140a38ad05e/command/sync.go#L280

sort the both lists

https://github.com/peak/s5cmd/blob/123b1c7fc9c614aa214a6795468fa140a38ad05e/command/sync.go#L223-L228

Then start comparing them -by taking advantage of the order.

Problem

To sort we need the full list of them. So we need to bring all of them to the memory. It might be tolerable if there are enough memory to keep them (in my tests, 1 million objects needed around 7 GB RAM at peak time). Once the number of files became huge the process is killed by fatal error: runtime: out of memory.

Solution Candidates

Obtain the keys in a sorted order rather than sorting after obtaining

Goal is to ensure that the storage.List sends objects inorder. So that we won't need to get full list and sort it[^2]. Just read from two channels until both of the channels are closed, and in the process send them to the appropriate channels[^3].

S3 Server

Objects are returned sorted in an ascending order of the respective key names in the list^1.

Local

The List method (with wildcards) returns almost sorted since filepath.Glob eventually sorts them. So does (?) the karrick/godirwalk.

Drawbacks

S3: Not all S3-compatible servers returns in sorted order.
Local: It is almost sorted. Consider three files (where / is directory seperator):
- a.b
- a/b
- ab Normally . < / < b so we expect them to be in the order I wrote. But when the aforementioned Sort operation is called the strings are
- a.b
- a
- ab
  since the name of the directory entry is a [^5]. So its output becomes a/b < a.b < ab. To fix it we might need to write our own walker.

External Sort

Use external sorting as @misuto proposed. There are some libraries on github, they might be worth of trying. Especially lanrat/extsort seems promising. It is using channels to get input and return outputs.

Drawback (?)

I've personally never implemented an external sorting algorithm. From my point of view there are a lot of unclear things. But still might be worth trying.

Use more efficient data structures (Trie)

At that point we have the full keys and full addresses (URL.Absolute ). So we might[^4] only use the absolute path instead of a URL object in our sort code/algorithm. Additionally we can use Trie to further reduce the memory usage.

Other tools

rclone

it uses in memory sorting. Note that their implementation was the reference source for s5cmd sync https://github.com/rclone/rclone/blob/HEAD/fs/march/march.go#L304

MC

MC does not sort all of the list. For remote, it relies on the order assumption. For local it uses a custom algorithm that returns objects in a sorted fashion. Its algorithm also handles the problem I mentioned earlier by appending an "/" to the end of directory (in Less method): https://github.com/minio/mc/blob/fe5b2f97e3d22324474e742edd15d9562467ef75/client-fs.go#L285-L297

Conclusion

~~We are tended to use, at least try, the third option (with Trie)~~. But, unfortunately, we will delay working on this issue at least untill the completion of #475 and #478.

Extra commentary

There are also some places that some objects are not released (references wasn't set to nil) immediately after they're used. I guess @Oppen meant some of those places. But even if we fix them they, probably, won't reduce the maximum memory usage, so it won't solve the out of memory error or fix the #447. Nevertheless, I think those improvements should've been done.

[^2]: Modify getSourceAndDestinationObjects (return channels) and remove the sorting code [^3]: Modify compareObjects (receive and return channels) [^4]: TBH, I suspect that we might need the URL.relative and URL.base fields too, but not sure, didn't think thorougly yet. [^5]: and the objects in that directory are not known yet

Aug 01 '22 16:08 kucukaslan

For reference, GCS doc says objects in the list are ordered lexicographically by name.

https://cloud.google.com/storage/docs/xml-api/get-bucket-list

Sep 14 '22 04:09 igungor