MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

Wrong accession numbers in colabfold_envdb_202108

Open seanrjohnson opened this issue 3 years ago • 0 comments

Expected Behavior

Accession-number-sequence associations should be the same between metaclust and colabfold_envdb

Current Behavior

The metaclust id seems to be correct.

colabfold_envdb seems to have scrambled the name-sequence associations. Particularly for the JGI sequence IDs (the Uniprot IDs that I checked seemed to be ok).

Steps to Reproduce (for bugs)

input:

wget https://metaclust.mmseqs.com/2018_06/metaclust_nr.fasta.gz
gunzip metaclust_nr.fasta.gz
grep -A1 GraSoiStandDraft_25_1057303.scaffolds.fasta_scaffold3271246_1

output:

>GraSoiStandDraft_25_1057303.scaffolds.fasta_scaffold3271246_1 # 1 # 249 # 1 # ID=3271246_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.582
HIGGTHYQNHHDFDPYLARVQQGELPVYRALTPSADERLIREFILQLKLGQVSRAYFQKKFGIELCERFRAPFQTLADWGLLA

input:

wget http://wwwuser.gwdg.de/~compbiol/colabfold/colabfold_envdb_202108.tar.gz
tar xvf colabfold_envdb_202108.tar.gz
grep GraSoiStandDraft_25_1057303.scaffolds.fasta_scaffold3271246_1 colabfold_envdb_202108_h.tsv

output:

117269648       GraSoiStandDraft_25_1057303.scaffolds.fasta_scaffold3271246_1

input:

grep 117269648 colabfold_envdb_202108_seq.tsv

output:

117269648       MAYTLPELSYDYAALEPHVDAETMRIHHDLHHAGYMNKLNAALEKYPEFFEKGIEDLMRNLDKIPEDVRGGVKNNGGGYFNHNLFWESMSPDGGAPEGELKDAIEKSFGSFDEMKEKFSNAAATQFGSGWAWLYKESDGSLGITNTSNQDIPFAEGRTLLMNLDVWEHSYYLKYQNKRPDYIENWWNVLNWKGVAEKFKS

I then go to

https://genome.jgi.doe.gov/portal/pages/dynamicOrganismDownload.jsf?organism=GraSoiStandDraft_25

and download the scaffolds for that sample.

I then make a blast database and blast the two sequences

>metaclust
HIGGTHYQNHHDFDPYLARVQQGELPVYRALTPSADERLIREFILQLKLGQVSRAYFQKKFGIELCERFRAPFQTLADWGLLA

>colabfold_envdb_202108
MAYTLPELSYDYAALEPHVDAETMRIHHDLHHAGYMNKLNAALEKYPEFFEKGIEDLMRNLDKIPEDVRGGVKNNGGGYFNHNLFWESMSPDGGAPEGELKDAIEKSFGSFDEMKEKFSNAAATQFGSGWAWLYKESDGSLGITNTSNQDIPFAEGRTLLMNLDVWEHSYYLKYQNKRPDYIENWWNVLNWKGVAEKFKS

against the scaffolds using tblastn. I get a perfect match on scaffold3271246 for the metaclust sequence, but the best match from the colabfold_envdb_202108 sequence has about 50% identity.

seanrjohnson avatar May 06 '22 20:05 seanrjohnson