Unable to download all Orthomarburg marburgense virus sequences using datasets CLI
Describe the bug I am using the most recent version
datasets --version
datasets version: 16.33.0
When I download taxonId 3052505 (Orthomarburg marburgense virus) using the datasets cli I get only 328 sequences, however NCBI virus shows that there are 376 Orthomarburg virus sequences: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=Orthomarburgvirus%20marburgense,%20taxid:3052505
To Reproduce
I am trying to download all Orthomarburg virus sequences. I use the command
datasets download virus genome taxon 3052505 --no-progressbar --filename results/ncbi_dataset.zip
However when I download the sequences there are only 328.
unzip -jp results/ncbi_dataset.zip ncbi_dataset/data/genomic.fna > results/sequences.fasta
grep -c "^>" results/sequences.fasta
returns 328
I confirmed this is also the case for metadata
dataformat tsv virus-genome --package results/ncbi_dataset.zip > results/metadata_post_extract.tsv
tail -n +2 results/metadata_post_extract.tsv | wc -l
returns 328 (ignoring the header).
Expected behavior I expect to download the same number of sequences that I also see on NCBI virus
Side note I had a similar issue in https://github.com/ncbi/datasets/issues/411 for CCHF.
I also noticed that most of the sequences that are missing start with JX, e.g. JX458851.1 and JX458853.1 - and I think are all linked to the same publication: https://pubmed.ncbi.nlm.nih.gov/23055920/, so potentially this is again a change to the taxonomy ID.
Hi anna-parker,
Thanks for opening this issue. It seems that TaxID 11269 was recently merged into TaxID 3052505. This change is the cause of the discrepancy you're seeing between NCBI Datasets and NCBI Virus. We're looking into the this issue and are working on getting the accessions with the merged TaxID updated.
I'll post a comment when the accessions are updated.
Nuala
My team is experiencing a similar issue. The best we can tell, sometime last week something changed regarding the core or core-nt database. Querying Rhinovirus/Enterovirus (taxid:12059) and Chlamydia pneumoniae (taxid:83558) does not return as many sequences as it did previously.
We use these databases for primer exclusivity/inclusivity testing and where we would previously get several primer matches for these organisms, we now get none.
Hi anna-parker and cphealy8,
The problem has been resolved. Please let us know if you encounter any other issues.
Moving forward, we’ll work on minimizing any delays in updates between NCBI Virus and Datasets to ensure smoother data consistency.
Nuala
Thank you @olearyna!