openfold icon indicating copy to clipboard operation
openfold copied to clipboard

Missing a3m files from the filtered unclust30 database in OpenProteinSet

Open damiano-sg opened this issue 2 years ago • 2 comments

Hello, I downloaded the entire uniclust30 filtered database from AWS and I see that some clusters have only the pdb folder and are missing the a3m folder with the MSA. Is there a reason for that? I counted 677 clusters that have this problem. Here are some of them: A0A023B4W7, A0A023SCZ3, A0A044VF87, A0A059F1C3. Here is a file with all the clusters missing the MSA: missing_msas.txt

Also, another question, how do I get the representative sequence for each cluster? Is it the first sequence in the a3m file? Because I saw that in some cases the first sequence is called consensus, like for instance in the case of A0A009FAV8, does that mean that the first sequence is not always the representative? Otherwise I tried to look at the list of clusters in the Uniclust30-2018_08 website to find the representative sequences but it looks like the cluster names are not the same as in OpenProteinSet.

damiano-sg avatar Jan 17 '24 12:01 damiano-sg

Hi were you able to find answers for getting the representative sequence for each cluster? I raised an issue https://github.com/aqlaboratory/openfold/issues/556 where I couldn't figure out (1) which uniclust version that Open Protein Set used (2) if they used 2021_06 (which is the most updated one from when they curated this database December 2021), then the number of a3m MSAs they've created ~16M does not match the number of clusters provided in uniref_mapping.tsv in the uniclust database.

I would appreciate if you could share some information! Thank you!

slee-ai avatar Nov 03 '25 21:11 slee-ai

Hi @slee-ai, no I did not hear back from the OpenFold team, I'm sorry.

damiano-sg avatar Nov 04 '25 17:11 damiano-sg