Overwrites existing file with identical file
I just ran the upload_file command using the exact same command that had previously completed. So filenames are the same, file is the same, bucket is the same.
As I understand it, B2 uses hashes, so I expected the CLI to realise that the file on the system had a matching hash and not upload it. But it is uploading it, which seems rather pointless.
The upload_file command is a rather simple command that does no checking of already uploaded files (but it does try to resume unfinished uploads). The sync command can be used to copy only new files. Currently this is done using the modification time as a default or alternatively the file size. Comparing the file hash may be a future addition.
Thanks. That makes some sense, though given it was doing upload-resuming I figured it would be smart enough to do "don't re-upload the same thing". I'd suggest this as a feature enhancement then.
To be precise, you would like the upload command to fail, if a file with same name and hash is already in the bucket (unless --ignoreDuplicate flag is passed). Is my understanding correct?
For this feature we need support from the API. Currently it's not possible to get the part hashes of large files after the upload has finished.
Am 22.09.2016 12:46 schrieb "Paweł Polewicz" [email protected]:
To be precise, you would like the upload command to fail, if a file with same name and hash is already in the bucket (unless --ignoreDuplicate flag is passed). Is my understanding correct?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Backblaze/B2_Command_Line_Tool/issues/249#issuecomment-248867868, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGT4YZujhKW7aUVY_1U4BIXQkyAZ5lJks5qslyQgaJpZM4KDNXQ .
We can compare the hash of the whole file
To be precise, you would like the upload command to fail, if a file with same name and hash is already in the bucket (unless --ignoreDuplicate flag is passed). Is my understanding correct?
Yep. It'd save bandwidth on both the client and B2 side.
@bwbeach is this design ok with you? It will decrease single-thread performance if --ignoreDuplicate is not provided due to additional ls operation. I can rename to --skipDuplicateCheck and document it properly.
Yes. I'm OK with this design.
We can compare the hash of the whole file
I believe there is no hash of the whole file, if it's a large file and the user didn't specify it during the upload (which the B2 CLI doesn't do).
Then we should fix that.
It's probably not calculated because this puts a huge load on the servers. It would have to hash the whole file again once b2_finish_large_file is called. Adding it now seems also infeasible, as this would mean you'd have to hash every large file that was uploaded to B2 so far (which are probably in the orders of petabytes).
I'm missing something. I thought the proposal was to compare the hash of the file being uploaded with the latest version of the file with the same name in B2. (For large files, this would work only if large_file_sha1 is set in the file info.)
Yes, I'm talking about the large file problem. That's probably the case most users care about.
We should update the CLI to set large_file_sha1.
You folks will likely know better than me, but I assumed this was possible because, at least looking at the API, the SHA1 is sent to backblaze at file upload time (https://www.backblaze.com/b2/docs/b2_upload_file.html), and is returned at download -
X-Bz-Content-Sha1
required
The SHA1 checksum of the content of the file. B2 will check this when the file is uploaded, to make sure that the file arrived correctly. It will be returned in the X-Bz-Content-Sha1 header when the file is downloaded.
b2_get_file_info also expose it (https://www.backblaze.com/b2/docs/b2_get_file_info.html) - could not those be used?
That's quite the performance decrease. I've already complained about having to specify the hash at the beginning of each part. Calculating the hash of the whole file in the beginning is even worse. This would mean that each file is read from disk three times! And also encrypted three times with the encryption feature I'm implementing!
[...] I assumed this was possible because, at least looking at the API, the SHA1 is sent to backblaze at file upload time [...]
That's only the case for "small" files. If the b2_start_large_file/b2_upload_part/b2_finish_large_file functions are used the hash of the whole file is optional.
Then lets compare hashes of all parts instead of comparing the single hash of the whole file. Can we retrieve them somehow from b2 for that purpose?
At the moment only while the file is still unfinished. But bwbeach said a while ago that the hashes are kept and an additional API function could allow access to them.
I'm fine with adding the API.
We already have b2_list_parts, but it returns an error if you call it on a file that has been finished. It may be as simple as removing that check, and letting it run.
Ok, then lets wait for the API extension.