refseq_virus/NCBI taxonomy behind current VMR.39 on more than a year
I would suggest rebuild metabuli viral reference database using current official viral taxonomy provided on https://ictv.global/vmr
Technically, it is not difficult. Your tool processed it very smoothly. If interested, I can provide you url where new database can be downloaded. You will be suprprised how many virus names/lineages can not be synchronized between these two major resources - thousands...
Thank you for pointing this out and great to hear that Metabuli was working well. I should check the resource right away. Could you please provide the url to download the DB? I want to test and reproduce it. Does the resource provide NCBI-style .dmp files like nodes.dmp and names.dmp?
Do you have access to AWS S3? s3://serratus-public/igortu/metabuli.VMR.39/ or https://serratus-public.s3.amazonaws.com/igortu/metabuli.VMR.39/
This directory contains all required files including dmp and nucleotide fasta >2Gb. whole VMR taxonomy now has new taxids incompatible with NCBI taxonomy. Full synchronization between VMR and NCBI taxonomy ids right now impossible.
Great! I'm downloading the directory. I'd like to know how did you generate the dmp files. Does VMR provide them?
Thank you for detailed explanation! I want to provide users both: viral DB with NCBI taxonomy and viral DB with VMR taxonomy. There are some reasons why I wanted know how you generated dmp files.
- I want to be able to build a DB with VMR for myself in order to provide an updated DB when new VMR is releasesd.
- I want to build VMR-based DB using human and viral genomes. Could you share your knowledges or scripts?
as first step take column "species" from VMR and try map them into NCBI names.dmp file
interesting how many species from VMR can not be mapped to current NCBI taxonomy.
I'll do it myself later today as well. I want recreate metabuli reference database with accession2speciestaxid
NCBI taxonomy outdated on species level as well. I see many ICTV species just not presented in NCBI taxonomy
Alphacrustrhavirus wenling 2846657 | Alphacrustrhavirus wenling | | scientific name | Alphacrustrhavirus zhejiang 2846696 | Alphacrustrhavirus zhejiang | | scientific name | Alphadintovirus mayetiola 2843674 | Alphadintovirus mayetiola | | scientific name | Alphadrosrhavirus hubei 2844861 | Alphadrosrhavirus hubei | | scientific name | Alphadrosrhavirus shayang 2846193 | Alphadrosrhavirus shayang | | scientific name | Alphaendornavirus agarici 2734345 | Alphaendornavirus agarici | | scientific name | Alphaendornavirus basellae Alphaendornavirus capsici Alphaendornavirus cucumis Alphaendornavirus cyamopsis Alphaendornavirus erysiphes Alphaendornavirus fucapsici Alphaendornavirus fuphaseoli Alphaendornavirus helianthi
s3://serratus-public/igortu/metabuli.VMR.39/vmr.gc - genetic codes for genbank accessions s3://serratus-public/igortu/metabuli.VMR.39/vmr.39.tsv - VMR.39, typos in genbank accessions fixed, you can compare it with original version and see my fixes , just export xlsx to tsv.
s3://serratus-public/igortu/metabuli.VMR.39/ReadTSV.py
you can modify script as you need: python ReadTSV.py vmr.39.tsv vmr.gc produce all dmp files.
Viral Refseq last time was synchronized with VMR in May,2023. VMR.39 released May 2024.
Thank you so much for sharing your DB. I downloaded and used it to classify SARS-CoV-2 reads, but I found that SARS-CoV-2 is not included in the DB. Is it missed due to the discrepancy between RefSeq and VMR?
- I'll try to build a VMR-based DB after Korean Thanksgiving.
SARS-COV2-2 : severe acute respiratory syndrome coronavirus 2
presented in VMR.39 as MN908947:NC_045512 - it has new species name "Betacoronavirus pandemicum"
NCBI Taxonomy still report it https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2697049 as "Severe acute respiratory syndrome-related coronavirus" species
After you you rebuild metabuli viral db I would suggest you try it on ICTV challenge: https://ictv-vbeg.github.io/ICTV-TaxonomyChallenge/
VMR.39.2 just released, new column "taxid" added,check how many Isolates not matched.
Hi Igor !! I finally built a virus DB using VMR.39.2 Could you try this database and give feedback? It's available here. https://hulk.mmseqs.com/jaebeom/vmr39.2/ (Some viruses without genbank accession were missed)
Please use this version. https://github.com/jaebeom-kim/Metabuli The DB is not compatible to the latest release.
Very good!I think , as first good test can be taken ICTV Computational Challengeictv.globaljust compare NCBI taxonomy vs ICTV.VMR.39.2 metabuli output and put difference somewhere on github.Only question I have for now:How you deal with bacterial genomes(accessions for prophages inside complete chromosomes) which presented in VMR? included full chtomosomeexcluded completelyonly prophage region included?Sent from my iPhoneOn Oct 15, 2024, at 4:00 AM, Jaebeom Kim @.***> wrote: Hi Igor !! I finally built a virus DB using VMR.39.2 Could you try this database and give feedback? It's available here. https://hulk.mmseqs.com/jaebeom/vmr39.2/ (Some viruses without genbank accession were missed)
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>
Betacoronavirus pandemicum
This is a good example of how ridiculous nomenclature can be. Now all sarbecoviruses are all called this, while demolishing previous species level would definitely cause more confusion than understanding.
"Virus Name" is not really tax rank.
it is assembly name or sequence name, it can be non unique, linked to assembly accession(NCBI), proteome_id (UniRef), isolate_id(ICTV).
Taxonomy tree after VMR introduction should contain only officially recognized ICTV names, everything else need to be organized different way outside of taxonomy. It requires some operational level on top of archieval INSDC/SRA , but it looks like nobody ready for this step yet.
As one possible decision , see Serratus.io project which process whole SRA in real time (sponsor -Amazon cloud)
@igortru
Only question I have for now:How you deal with bacterial genomes(accessions for prophages inside complete chromosomes) which presented in VMR? included full chtomosomeexcluded completelyonly prophage region included?
I used full sequences.
you need take location provided in VMR or exclude them completely, otherwise you will have a lot of False Pisitive hits into Bacterial lineages.