lbrytools icon indicating copy to clipboard operation
lbrytools copied to clipboard

Clean up orphaned blobs from claims

Open belikor opened this issue 4 years ago • 1 comments

With LBRY when a claim is downloaded, it downloads blob files that are stored on the blobfiles directory. In Linux this is normally

/home/user/.local/share/lbrynet/blobfiles

However, if the claim is re-uploaded, for example, if the file is re-encoded, the blobs will be different. A new set of blobs will have to be downloaded, but the old blobs will remain in the system taking hard drive space.

A function needs to be created to examine the blobfiles directory so that only the currently managed claims have blobs. All other blobs, which are not tied to a specific claim, should be deleted so that they don't take unnecessary space in the system.


Each claim with a URI or 'claim_id' will have a "manifest" blob file. This blob file is named after the 'sd_hash' of the claim. This information is found under a specific key in the dictionary representing the claim, item["value"]["source"]["sd_hash"].

Inside this manifest blob file there is JSON data with all blobs that make the claim. Therefore, by examining this manifest blob file, we can know if all its blobs are present in the blobfiles directory or not.

We can get all claims with search.sort_files (lbrynet file list), and examine the 'sd_hash' of each of them, to find all blobs in blobfiles.

All additional blobs that don't seem to belong to any claim, that is, that are not contained in any manifest blob file, should be considered orphaned, and thus can be deleted from the system.

Reference documentation of how the content is encoded in LBRY by using blobs https://lbry.tech/spec#data

belikor avatar May 26 '21 23:05 belikor

With the new functions count_blobs and count_blobs_all (2ce6b29d7, 0490a4a578, b46e04122, f0e80ba70) now we can count the blobs of each claim, and test whether the blob files corresponding to that claim are in the blobfiles directory.

We can identify 5 cases for each claim: all blobs present, some blobs missing, no 'sd_hash' blob (maybe manually deleted), claim not found (maybe deleted in the network), other errors.

To find orphaned blobs we have to count the first two cases, and maybe redownload the claim in the third case to make sure the 'sd_hash' is accounted. The fourth case is a claim that doesn't exist anymore online, so we probably don't want to keep its blobs either. The fifth case is particularly rare, so we don't expect to have it under normal circumstances.

belikor avatar Jun 15 '21 04:06 belikor