MMseqs2
MMseqs2 copied to clipboard
Wrong accession numbers in colabfold_envdb_202108
Expected Behavior
Accession-number-sequence associations should be the same between metaclust and colabfold_envdb
Current Behavior
The metaclust id seems to be correct.
colabfold_envdb seems to have scrambled the name-sequence associations. Particularly for the JGI sequence IDs (the Uniprot IDs that I checked seemed to be ok).
Steps to Reproduce (for bugs)
input:
wget https://metaclust.mmseqs.com/2018_06/metaclust_nr.fasta.gz
gunzip metaclust_nr.fasta.gz
grep -A1 GraSoiStandDraft_25_1057303.scaffolds.fasta_scaffold3271246_1
output:
>GraSoiStandDraft_25_1057303.scaffolds.fasta_scaffold3271246_1 # 1 # 249 # 1 # ID=3271246_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.582
HIGGTHYQNHHDFDPYLARVQQGELPVYRALTPSADERLIREFILQLKLGQVSRAYFQKKFGIELCERFRAPFQTLADWGLLA
input:
wget http://wwwuser.gwdg.de/~compbiol/colabfold/colabfold_envdb_202108.tar.gz
tar xvf colabfold_envdb_202108.tar.gz
grep GraSoiStandDraft_25_1057303.scaffolds.fasta_scaffold3271246_1 colabfold_envdb_202108_h.tsv
output:
117269648 GraSoiStandDraft_25_1057303.scaffolds.fasta_scaffold3271246_1
input:
grep 117269648 colabfold_envdb_202108_seq.tsv
output:
117269648 MAYTLPELSYDYAALEPHVDAETMRIHHDLHHAGYMNKLNAALEKYPEFFEKGIEDLMRNLDKIPEDVRGGVKNNGGGYFNHNLFWESMSPDGGAPEGELKDAIEKSFGSFDEMKEKFSNAAATQFGSGWAWLYKESDGSLGITNTSNQDIPFAEGRTLLMNLDVWEHSYYLKYQNKRPDYIENWWNVLNWKGVAEKFKS
I then go to
https://genome.jgi.doe.gov/portal/pages/dynamicOrganismDownload.jsf?organism=GraSoiStandDraft_25
and download the scaffolds for that sample.
I then make a blast database and blast the two sequences
>metaclust
HIGGTHYQNHHDFDPYLARVQQGELPVYRALTPSADERLIREFILQLKLGQVSRAYFQKKFGIELCERFRAPFQTLADWGLLA
>colabfold_envdb_202108
MAYTLPELSYDYAALEPHVDAETMRIHHDLHHAGYMNKLNAALEKYPEFFEKGIEDLMRNLDKIPEDVRGGVKNNGGGYFNHNLFWESMSPDGGAPEGELKDAIEKSFGSFDEMKEKFSNAAATQFGSGWAWLYKESDGSLGITNTSNQDIPFAEGRTLLMNLDVWEHSYYLKYQNKRPDYIENWWNVLNWKGVAEKFKS
against the scaffolds using tblastn. I get a perfect match on scaffold3271246 for the metaclust sequence, but the best match from the colabfold_envdb_202108 sequence has about 50% identity.