Large memory utilisation when syncing many files
Problem
I have a problem when I try to sync a large amount of files, approximately 400 000 files. When I do this around 11 GB of memory is used. It works for now but will become problematic when I try to move this process to a weaker VM.
I did some debugging and concluded that the problem is when the metadata of the files(type Object) are fetched and stored in-memory in the function getSourceAndDestinationObjects.
Proposed solution
The solution I propose is to implement a new flag that will use external sorting instead of in-memory sorting. This should be an optional flag as the performance will be lower for this mode. I do not have any experience with external sorting but I would like to help by working on this after we have decided how this could be solved, I am of course open to new ideas for solving this problem.
Version: v2.0.0-beta.2
Since the order is only used to determine whether one object is only in source, only in destination or in both, maybe partitioning it to files is easier and more performant than external sorting. Then you only need to read the contents of the right partition (say, apply some hash and use the result), if one object is present only in the right set then it will be present in the file that matches its hash for the right set and not in the file that matches its hash for the left set. You can build a set in memory per file and discard it after you're done with it. It would need to be a different hash than whatever Go uses for maps tho, otherwise your set will be just a list 🤔
You can of course create a list with the contents of each file and sort that, and after that use the same old algorithm.
In any case, I see many temporary allocations could come from the appends used willy nilly in that part of the code, and from the fact it does get everything together (which is not completely avoidable).
Thanks, a lot to both of you!
Current logic of sync
To determine which files are to be copied, sync commands need to identify which files are present in only source, only destination, and in both. The straightforward (and a good) solution is:
- to obtain list of files in both source and destination:
https://github.com/peak/s5cmd/blob/123b1c7fc9c614aa214a6795468fa140a38ad05e/command/sync.go#L280
- sort the both lists
https://github.com/peak/s5cmd/blob/123b1c7fc9c614aa214a6795468fa140a38ad05e/command/sync.go#L223-L228
- Then start comparing them -by taking advantage of the order.
Problem
To sort we need the full list of them. So we need to bring all of them to the memory. It might be tolerable if there are enough memory to keep them (in my tests, 1 million objects needed around 7 GB RAM at peak time). Once the number of files became huge the process is killed by fatal error: runtime: out of memory.
Solution Candidates
Obtain the keys in a sorted order rather than sorting after obtaining
Goal is to ensure that the storage.List sends objects inorder. So that we won't need to get full list and sort it[^2]. Just read from two channels until both of the channels are closed, and in the process send them to the appropriate channels[^3].
S3 Server
Objects are returned sorted in an ascending order of the respective key names in the list^1.
Local
The List method (with wildcards) returns almost sorted since filepath.Glob eventually sorts them. So does (?) the karrick/godirwalk.
Drawbacks
- S3: Not all S3-compatible servers returns in sorted order.
- Local: It is almost sorted. Consider three files (where
/is directory seperator):- a.b
- a/b
- ab
Normally
.</<bso we expect them to be in the order I wrote. But when the aforementioned Sort operation is called the strings are - a.b
- a
- ab
since the name of the directory entry isa[^5]. So its output becomesa/b<a.b<ab. To fix it we might need to write our own walker.
External Sort
Use external sorting as @misuto proposed. There are some libraries on github, they might be worth of trying. Especially lanrat/extsort seems promising. It is using channels to get input and return outputs.
Drawback (?)
I've personally never implemented an external sorting algorithm. From my point of view there are a lot of unclear things. But still might be worth trying.
Use more efficient data structures (Trie)
At that point we have the full keys and full addresses (URL.Absolute ). So we might[^4] only use the absolute path instead of a URL object in our sort code/algorithm. Additionally we can use Trie to further reduce the memory usage.
Other tools
rclone
it uses in memory sorting. Note that their implementation was the reference source for s5cmd sync https://github.com/rclone/rclone/blob/HEAD/fs/march/march.go#L304
MC
MC does not sort all of the list. For remote, it relies on the order assumption. For local it uses a custom algorithm that returns objects in a sorted fashion. Its algorithm also handles the problem I mentioned earlier by appending an "/" to the end of directory (in Less method): https://github.com/minio/mc/blob/fe5b2f97e3d22324474e742edd15d9562467ef75/client-fs.go#L285-L297
Conclusion
We are tended to use, at least try, the third option (with Trie). But, unfortunately, we will delay working on this issue at least untill the completion of #475 and #478.
Extra commentary
There are also some places that some objects are not released (references wasn't set to nil) immediately after they're used. I guess @Oppen meant some of those places. But even if we fix them they, probably, won't reduce the maximum memory usage, so it won't solve the out of memory error or fix the #447. Nevertheless, I think those improvements should've been done.
[^2]: Modify getSourceAndDestinationObjects (return channels) and remove the sorting code [^3]: Modify compareObjects (receive and return channels) [^4]: TBH, I suspect that we might need the URL.relative and URL.base fields too, but not sure, didn't think thorougly yet. [^5]: and the objects in that directory are not known yet
For reference, GCS doc says objects in the list are ordered lexicographically by name.
https://cloud.google.com/storage/docs/xml-api/get-bucket-list