jackrabbit-oak
jackrabbit-oak copied to clipboard
OAK-9960: (oak-run) introduced datastore-copy command
Introduced datastore-copy command in oak-run to download blobs/files from an Azure repository using multiple threads.
Just a few general comments about error handling.
- Imagine that there is some network problem that makes all download attempts fail. The error could be raised immediately upon trying the connection (like hostname not found) or it could be a timeout and raised only after 1 minute of each connection attempt. My understanding of the current implementation is that it would not abort early and would try to download every file. And it seems the only error reporting is at the end, once every file was processed by the downloader. So in the case of a timeout connecting or reading from the blob store, the tool could stay for a long time (hours) trying to download blobs without any success, while showing no indication that something is wrong. At a bare minimum, the tool should log any errors as soon as they happen, so the operator can abort the transfer. Maybe even abort the execution if a file fails to download.
- I don't see any logic to deal with transient errors. If downloading a dataset may take many hours, the chance of some transient error is very high. What will happen in this case? Will the whole transfer have to be started over again from start? This could easily lead to situations where it becomes close to impossible to download a large dataset because of random transient failures.
@nfsantos
- This implementation mimics pretty much what
azcopy copycommand does. In case of errors, the blob is just skipped and reported at the end of the execution. To fail fast, we could introduce a flag (eg:fail-on-error). Whentruethe command will fail when the first item fails without waiting until the end. I have introduced an intermediate log error message as you proposed, so the operator does not have to wait until the end in case something goes wrong. - Even on this, I have replicated what
azcopy copydoes. We should introduce aretry<int>flag to instruct the command to retry failed operations.
Both enhancements can be addressed with separate PRs.