B2_Command_Line_Tool Overwrites existing file with identical file

I just ran the upload_file command using the exact same command that had previously completed. So filenames are the same, file is the same, bucket is the same.

As I understand it, B2 uses hashes, so I expected the CLI to realise that the file on the system had a matching hash and not upload it. But it is uploading it, which seems rather pointless.

Sep 21 '16 19:09 mohmad-null

The upload_file command is a rather simple command that does no checking of already uploaded files (but it does try to resume unfinished uploads). The sync command can be used to copy only new files. Currently this is done using the modification time as a default or alternatively the file size. Comparing the file hash may be a future addition.

Sep 21 '16 20:09 svonohr

Thanks. That makes some sense, though given it was doing upload-resuming I figured it would be smart enough to do "don't re-upload the same thing". I'd suggest this as a feature enhancement then.

Sep 22 '16 10:09 mohmad-null

To be precise, you would like the upload command to fail, if a file with same name and hash is already in the bucket (unless --ignoreDuplicate flag is passed). Is my understanding correct?

Sep 22 '16 10:09 ppolewicz

For this feature we need support from the API. Currently it's not possible to get the part hashes of large files after the upload has finished.

Am 22.09.2016 12:46 schrieb "Paweł Polewicz" [email protected]:

To be precise, you would like the upload command to fail, if a file with same name and hash is already in the bucket (unless --ignoreDuplicate flag is passed). Is my understanding correct?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Backblaze/B2_Command_Line_Tool/issues/249#issuecomment-248867868, or mute the thread https://github.com/notifications/unsubscribe-auth/AQGT4YZujhKW7aUVY_1U4BIXQkyAZ5lJks5qslyQgaJpZM4KDNXQ .

Sep 22 '16 10:09 svonohr

We can compare the hash of the whole file

Sep 22 '16 11:09 ppolewicz

To be precise, you would like the upload command to fail, if a file with same name and hash is already in the bucket (unless --ignoreDuplicate flag is passed). Is my understanding correct?

Yep. It'd save bandwidth on both the client and B2 side.

Sep 22 '16 11:09 mohmad-null

@bwbeach is this design ok with you? It will decrease single-thread performance if --ignoreDuplicate is not provided due to additional ls operation. I can rename to --skipDuplicateCheck and document it properly.

Sep 22 '16 11:09 ppolewicz

Yes. I'm OK with this design.

Sep 22 '16 16:09 bwbeach

We can compare the hash of the whole file

I believe there is no hash of the whole file, if it's a large file and the user didn't specify it during the upload (which the B2 CLI doesn't do).

Sep 22 '16 16:09 svonohr

Then we should fix that.

Sep 22 '16 16:09 ppolewicz

It's probably not calculated because this puts a huge load on the servers. It would have to hash the whole file again once b2_finish_large_file is called. Adding it now seems also infeasible, as this would mean you'd have to hash every large file that was uploaded to B2 so far (which are probably in the orders of petabytes).

Sep 22 '16 16:09 svonohr

I'm missing something. I thought the proposal was to compare the hash of the file being uploaded with the latest version of the file with the same name in B2. (For large files, this would work only if large_file_sha1 is set in the file info.)

Sep 22 '16 16:09 bwbeach

Yes, I'm talking about the large file problem. That's probably the case most users care about.

Sep 22 '16 16:09 svonohr

We should update the CLI to set large_file_sha1.

Sep 22 '16 16:09 bwbeach

You folks will likely know better than me, but I assumed this was possible because, at least looking at the API, the SHA1 is sent to backblaze at file upload time (https://www.backblaze.com/b2/docs/b2_upload_file.html), and is returned at download -

X-Bz-Content-Sha1

required

The SHA1 checksum of the content of the file. B2 will check this when the file is uploaded, to make sure that the file arrived correctly. It will be returned in the X-Bz-Content-Sha1 header when the file is downloaded.

b2_get_file_info also expose it (https://www.backblaze.com/b2/docs/b2_get_file_info.html) - could not those be used?

Sep 22 '16 16:09 mohmad-null

That's quite the performance decrease. I've already complained about having to specify the hash at the beginning of each part. Calculating the hash of the whole file in the beginning is even worse. This would mean that each file is read from disk three times! And also encrypted three times with the encryption feature I'm implementing!

[...] I assumed this was possible because, at least looking at the API, the SHA1 is sent to backblaze at file upload time [...]

That's only the case for "small" files. If the b2_start_large_file/b2_upload_part/b2_finish_large_file functions are used the hash of the whole file is optional.

Sep 22 '16 16:09 svonohr

Then lets compare hashes of all parts instead of comparing the single hash of the whole file. Can we retrieve them somehow from b2 for that purpose?

Sep 22 '16 17:09 ppolewicz

At the moment only while the file is still unfinished. But bwbeach said a while ago that the hashes are kept and an additional API function could allow access to them.

Sep 22 '16 17:09 svonohr

I'm fine with adding the API.

We already have b2_list_parts, but it returns an error if you call it on a file that has been finished. It may be as simple as removing that check, and letting it run.

Sep 23 '16 00:09 bwbeach

Ok, then lets wait for the API extension.

Sep 23 '16 01:09 ppolewicz