Save downloaded genomes from NCBI
Thanks so much for your continued work on this really helpful workflow, @AstrobioMike !
I have a feature request (so low priority) regarding GToTree. Currently, if a list of NCBI assembly accession numbers is provided as input to GToTree (via -a), GToTree automatically downloads the genome for each accession, predicts amino acids when amino acid files don't already exist, and then runs the SCG search/alignment workflow. Being able to download genomes from NCBI like this is extremely helpful. However, I sometimes find myself wanting to work with the amino acid sequence files for the analyzed genomes after GToTree is finished. It seems like GToTree deletes these amino acid files (and does not save them even in the tmp directory with -d, debug mode). Might it be possible to add a flag to keep these files or to preserve them when debug mode (-d) is set?
Again, this is not urgent, because I can just download the genomes again myself if needed. Thanks so much in advance, and again, I have so appreciated this useful tool!
Hi there, @jmtsuji!
Thanks for the kind words!
I will look into adding an option for this when I can, or at least there’s certainly no reason they shouldn’t be saved with the debug flag like you tried!
You mentioned you could download them yourself, but I’ll also note have the same NCBI download functionality packaged with my bit package for this very purpose, it just takes input assembly accessions just like GToTree.
The conda install steps are here: https://github.com/AstrobioMike/bit?tab=readme-ov-file#conda-install
and then you’d want the program bit-dl-ncbi-assemblies, and passing -f protein along with the input wanted accessions would download the amino acid files if they are available. If that’s helpful to you
Thanks for the suggestion!
@AstrobioMike Thanks for the quick response! Also, good to know about bit; bit-dl-ncbi-assemblies could potentially be quite useful. All the best!
This was a looong time ago, @jmtsuji, but this is finally implemented now as of v1.8.10 :)
- with the debug flag set while running, GToTree will keep specific files in
<output_dir>/<tmp_dir>/ncbi-downloads/:- if amino-acid seqs are used, it will keep the downloaded amino-acid seqs
- if there were no amino-acid seqs, and the genome had to be downloaded, it will keep the downloaded genome and the prodigal-called amino-acid seqs
- if using nucleotide mode (
-z), it will keep the downloaded genome and the prodigal-called nt cds and amino-acid seqs
It's not ideal for all situations, as the amino-acid seqs aren't going to be named as the single-copy gene they are identified as, they will instead have the names NCBI has for them. But everything downloaded will at least be saved now with the debug flag if wanting to use it for something else or look for specific things after.
Thank you for the suggestion!
@AstrobioMike Wow, thanks for implementing this! I will have to give this a try the next time I am using GToTree with downloaded genomes. All the best!