datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Unable to download all Orthomarburg marburgense virus sequences using datasets CLI

Open anna-parker opened this issue 1 year ago • 1 comments

Describe the bug I am using the most recent version

datasets --version                                                                        
datasets version: 16.33.0

When I download taxonId 3052505 (Orthomarburg marburgense virus) using the datasets cli I get only 328 sequences, however NCBI virus shows that there are 376 Orthomarburg virus sequences: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=Orthomarburgvirus%20marburgense,%20taxid:3052505

To Reproduce

I am trying to download all Orthomarburg virus sequences. I use the command

datasets download virus genome taxon 3052505  --no-progressbar --filename results/ncbi_dataset.zip

However when I download the sequences there are only 328.

unzip -jp results/ncbi_dataset.zip ncbi_dataset/data/genomic.fna > results/sequences.fasta
grep -c "^>" results/sequences.fasta 

returns 328

I confirmed this is also the case for metadata

dataformat tsv virus-genome --package results/ncbi_dataset.zip > results/metadata_post_extract.tsv
tail -n +2 results/metadata_post_extract.tsv | wc -l

returns 328 (ignoring the header).

Expected behavior I expect to download the same number of sequences that I also see on NCBI virus

Side note I had a similar issue in https://github.com/ncbi/datasets/issues/411 for CCHF.

I also noticed that most of the sequences that are missing start with JX, e.g. JX458851.1 and JX458853.1 - and I think are all linked to the same publication: https://pubmed.ncbi.nlm.nih.gov/23055920/, so potentially this is again a change to the taxonomy ID.

anna-parker avatar Oct 25 '24 18:10 anna-parker

Hi anna-parker,

Thanks for opening this issue. It seems that TaxID 11269 was recently merged into TaxID 3052505. This change is the cause of the discrepancy you're seeing between NCBI Datasets and NCBI Virus. We're looking into the this issue and are working on getting the accessions with the merged TaxID updated.

I'll post a comment when the accessions are updated.

Nuala

olearyna avatar Oct 25 '24 20:10 olearyna

My team is experiencing a similar issue. The best we can tell, sometime last week something changed regarding the core or core-nt database. Querying Rhinovirus/Enterovirus (taxid:12059) and Chlamydia pneumoniae (taxid:83558) does not return as many sequences as it did previously.

We use these databases for primer exclusivity/inclusivity testing and where we would previously get several primer matches for these organisms, we now get none.

cphealy8 avatar Oct 28 '24 14:10 cphealy8

Hi anna-parker and cphealy8,

The problem has been resolved. Please let us know if you encounter any other issues.

Moving forward, we’ll work on minimizing any delays in updates between NCBI Virus and Datasets to ensure smoother data consistency.

Nuala

olearyna avatar Oct 30 '24 16:10 olearyna

Thank you @olearyna!

anna-parker avatar Oct 30 '24 21:10 anna-parker