foldseek icon indicating copy to clipboard operation
foldseek copied to clipboard

Question about UniProt50 database

Open yakomaxa opened this issue 3 years ago • 7 comments

Dear foldseek developers,

my questions are about UniProt50 DB. I read your preprint and it says UniProt50 is

Uniprot/AlphaFold database clustered to 50% sequence identity

First question is about the clustering method. I guess you used mmseq2, but is my guess right?

Second question is about the availability of the clustering result. Is the clustering result available anywhere? I would like to access the mapping between cluster representatives and cluster members.

P.S.
I found a typo in your preprint. It says UniProt50 contains 52,327,413 million models but "million" seems not to be needed here.

Best regards.

yakomaxa avatar Feb 04 '23 11:02 yakomaxa

(1) We did use MMseqs2 for clustering using mmseqs cluster ... -c 0.9 --min-seq-id 0.5 --cluster-reassign 1. We pick the highest plddt as rep. per cluster. (2) I think @milot-mirdita has them on the cluster. Could we upload them to r2?

Thank you for finding the "52,327,413 million models" mistake.

martin-steinegger avatar Feb 10 '23 05:02 martin-steinegger

Thank you for your clarification of the clustering protocols.

This issue is about UniProt50, but how about ESM-atlas high-quality 30%? The About page of ESM-atlas says the Foldseek search runs against the clustered ESM database, but they do not have provide access to the clustering results.

It will help a lot if the results are made available.

Best regards.

yakomaxa avatar Mar 02 '23 15:03 yakomaxa

It would be great if the cluster assignments for Uniprot50 and ESMAtlas30 could be made available!

wresch avatar Apr 20 '23 13:04 wresch

@wresch I agree. We will update the database soon.

martin-steinegger avatar Apr 26 '23 16:04 martin-steinegger

@milot-mirdita where can I find the mapping between cluster representatives and cluster members of UniProt50? Thanks

gioodm avatar Aug 14 '23 13:08 gioodm

@gioodm The latest foldseek git version (precompiled binaries at: https://mmseqs.com/foldseek/) Includes now a download script to download the AFDB50 with clustering information included. You can also use the --cluster-search 1 parameter to do a faster search against the representatives while still getting results for all AFDB structures.

milot-mirdita avatar Aug 22 '23 04:08 milot-mirdita

Great thank you!

gioodm avatar Aug 22 '23 09:08 gioodm