Metabuli icon indicating copy to clipboard operation
Metabuli copied to clipboard

refseq_virus/NCBI taxonomy behind current VMR.39 on more than a year

Open igortru opened this issue 1 year ago • 17 comments

I would suggest rebuild metabuli viral reference database using current official viral taxonomy provided on https://ictv.global/vmr

Technically, it is not difficult. Your tool processed it very smoothly. If interested, I can provide you url where new database can be downloaded. You will be suprprised how many virus names/lineages can not be synchronized between these two major resources - thousands...

igortru avatar Sep 10 '24 13:09 igortru

Thank you for pointing this out and great to hear that Metabuli was working well. I should check the resource right away. Could you please provide the url to download the DB? I want to test and reproduce it. Does the resource provide NCBI-style .dmp files like nodes.dmp and names.dmp?

jaebeom-kim avatar Sep 10 '24 14:09 jaebeom-kim

Do you have access to AWS S3? s3://serratus-public/igortu/metabuli.VMR.39/ or https://serratus-public.s3.amazonaws.com/igortu/metabuli.VMR.39/

This directory contains all required files including dmp and nucleotide fasta >2Gb. whole VMR taxonomy now has new taxids incompatible with NCBI taxonomy. Full synchronization between VMR and NCBI taxonomy ids right now impossible.

igortru avatar Sep 10 '24 14:09 igortru

Great! I'm downloading the directory. I'd like to know how did you generate the dmp files. Does VMR provide them?

jaebeom-kim avatar Sep 11 '24 06:09 jaebeom-kim

Thank you for detailed explanation! I want to provide users both: viral DB with NCBI taxonomy and viral DB with VMR taxonomy. There are some reasons why I wanted know how you generated dmp files.

  1. I want to be able to build a DB with VMR for myself in order to provide an updated DB when new VMR is releasesd.
  2. I want to build VMR-based DB using human and viral genomes. Could you share your knowledges or scripts?

jaebeom-kim avatar Sep 11 '24 12:09 jaebeom-kim

as first step take column "species" from VMR and try map them into NCBI names.dmp file

interesting how many species from VMR can not be mapped to current NCBI taxonomy.

I'll do it myself later today as well. I want recreate metabuli reference database with accession2speciestaxid

igortru avatar Sep 11 '24 12:09 igortru

NCBI taxonomy outdated on species level as well. I see many ICTV species just not presented in NCBI taxonomy

Alphacrustrhavirus wenling 2846657 | Alphacrustrhavirus wenling | | scientific name | Alphacrustrhavirus zhejiang 2846696 | Alphacrustrhavirus zhejiang | | scientific name | Alphadintovirus mayetiola 2843674 | Alphadintovirus mayetiola | | scientific name | Alphadrosrhavirus hubei 2844861 | Alphadrosrhavirus hubei | | scientific name | Alphadrosrhavirus shayang 2846193 | Alphadrosrhavirus shayang | | scientific name | Alphaendornavirus agarici 2734345 | Alphaendornavirus agarici | | scientific name | Alphaendornavirus basellae Alphaendornavirus capsici Alphaendornavirus cucumis Alphaendornavirus cyamopsis Alphaendornavirus erysiphes Alphaendornavirus fucapsici Alphaendornavirus fuphaseoli Alphaendornavirus helianthi

igortru avatar Sep 11 '24 13:09 igortru

s3://serratus-public/igortu/metabuli.VMR.39/vmr.gc - genetic codes for genbank accessions s3://serratus-public/igortu/metabuli.VMR.39/vmr.39.tsv - VMR.39, typos in genbank accessions fixed, you can compare it with original version and see my fixes , just export xlsx to tsv.

s3://serratus-public/igortu/metabuli.VMR.39/ReadTSV.py

you can modify script as you need: python ReadTSV.py vmr.39.tsv vmr.gc produce all dmp files.

igortru avatar Sep 11 '24 13:09 igortru

Viral Refseq last time was synchronized with VMR in May,2023. VMR.39 released May 2024.

igortru avatar Sep 11 '24 13:09 igortru

Thank you so much for sharing your DB. I downloaded and used it to classify SARS-CoV-2 reads, but I found that SARS-CoV-2 is not included in the DB. Is it missed due to the discrepancy between RefSeq and VMR?

  • I'll try to build a VMR-based DB after Korean Thanksgiving.

jaebeom-kim avatar Sep 15 '24 08:09 jaebeom-kim

SARS-COV2-2 : severe acute respiratory syndrome coronavirus 2 

presented in VMR.39 as MN908947:NC_045512 - it has new species name "Betacoronavirus pandemicum"

NCBI Taxonomy still report it https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2697049 as "Severe acute respiratory syndrome-related coronavirus" species

After you you rebuild metabuli viral db I would suggest you try it on ICTV challenge: https://ictv-vbeg.github.io/ICTV-TaxonomyChallenge/

igortru avatar Sep 15 '24 12:09 igortru

VMR.39.2 just released, new column "taxid" added,check how many Isolates not matched.

igortru avatar Sep 26 '24 22:09 igortru

Hi Igor !! I finally built a virus DB using VMR.39.2 Could you try this database and give feedback? It's available here. https://hulk.mmseqs.com/jaebeom/vmr39.2/ (Some viruses without genbank accession were missed)

Please use this version. https://github.com/jaebeom-kim/Metabuli The DB is not compatible to the latest release.

jaebeom-kim avatar Oct 15 '24 08:10 jaebeom-kim

Very good!I think , as first good test can be taken ICTV Computational Challengeictv.globaljust  compare NCBI taxonomy vs ICTV.VMR.39.2 metabuli output and put difference  somewhere on  github.Only  question I have  for now:How  you deal  with bacterial genomes(accessions for prophages inside complete chromosomes) which presented in VMR?  included full chtomosomeexcluded completelyonly prophage region included?Sent from my iPhoneOn Oct 15, 2024, at 4:00 AM, Jaebeom Kim @.***> wrote: Hi Igor !! I finally built a virus DB using VMR.39.2 Could you try this database and give feedback? It's available here. https://hulk.mmseqs.com/jaebeom/vmr39.2/ (Some viruses without genbank accession were missed)

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

igortru avatar Oct 15 '24 12:10 igortru

Betacoronavirus pandemicum

This is a good example of how ridiculous nomenclature can be. Now all sarbecoviruses are all called this, while demolishing previous species level would definitely cause more confusion than understanding.

Lelouchzhu avatar Oct 15 '24 15:10 Lelouchzhu

"Virus Name" is not really tax rank.

it is assembly name or sequence name, it can be non unique, linked to assembly accession(NCBI), proteome_id (UniRef), isolate_id(ICTV).

Taxonomy tree after VMR introduction should contain only officially recognized ICTV names, everything else need to be organized different way outside of taxonomy. It requires some operational level on top of archieval INSDC/SRA , but it looks like nobody ready for this step yet.

As one possible decision , see Serratus.io project which process whole SRA in real time (sponsor -Amazon cloud)

igortru avatar Oct 15 '24 15:10 igortru

@igortru

Only question I have for now:How you deal with bacterial genomes(accessions for prophages inside complete chromosomes) which presented in VMR? included full chtomosomeexcluded completelyonly prophage region included?

I used full sequences.

jaebeom-kim avatar Oct 22 '24 01:10 jaebeom-kim

you need take location provided in VMR or exclude them completely, otherwise you will have a lot of False Pisitive hits into Bacterial lineages.

igortru avatar Oct 22 '24 04:10 igortru