Question about UniProt50 database
Dear foldseek developers,
my questions are about UniProt50 DB. I read your preprint and it says UniProt50 is
Uniprot/AlphaFold database clustered to 50% sequence identity
First question is about the clustering method. I guess you used mmseq2, but is my guess right?
Second question is about the availability of the clustering result. Is the clustering result available anywhere? I would like to access the mapping between cluster representatives and cluster members.
P.S.
I found a typo in your preprint. It says UniProt50 contains 52,327,413 million models but "million" seems not to be needed here.
Best regards.
(1) We did use MMseqs2 for clustering using mmseqs cluster ... -c 0.9 --min-seq-id 0.5 --cluster-reassign 1. We pick the highest plddt as rep. per cluster.
(2) I think @milot-mirdita has them on the cluster. Could we upload them to r2?
Thank you for finding the "52,327,413 million models" mistake.
Thank you for your clarification of the clustering protocols.
This issue is about UniProt50, but how about ESM-atlas high-quality 30%? The About page of ESM-atlas says the Foldseek search runs against the clustered ESM database, but they do not have provide access to the clustering results.
It will help a lot if the results are made available.
Best regards.
It would be great if the cluster assignments for Uniprot50 and ESMAtlas30 could be made available!
@wresch I agree. We will update the database soon.
@milot-mirdita where can I find the mapping between cluster representatives and cluster members of UniProt50? Thanks
@gioodm The latest foldseek git version (precompiled binaries at: https://mmseqs.com/foldseek/) Includes now a download script to download the AFDB50 with clustering information included. You can also use the --cluster-search 1 parameter to do a faster search against the representatives while still getting results for all AFDB structures.
Great thank you!