GToTree icon indicating copy to clipboard operation
GToTree copied to clipboard

Save downloaded genomes from NCBI

Open jmtsuji opened this issue 1 year ago • 2 comments

Thanks so much for your continued work on this really helpful workflow, @AstrobioMike !

I have a feature request (so low priority) regarding GToTree. Currently, if a list of NCBI assembly accession numbers is provided as input to GToTree (via -a), GToTree automatically downloads the genome for each accession, predicts amino acids when amino acid files don't already exist, and then runs the SCG search/alignment workflow. Being able to download genomes from NCBI like this is extremely helpful. However, I sometimes find myself wanting to work with the amino acid sequence files for the analyzed genomes after GToTree is finished. It seems like GToTree deletes these amino acid files (and does not save them even in the tmp directory with -d, debug mode). Might it be possible to add a flag to keep these files or to preserve them when debug mode (-d) is set?

Again, this is not urgent, because I can just download the genomes again myself if needed. Thanks so much in advance, and again, I have so appreciated this useful tool!

jmtsuji avatar Jun 05 '24 13:06 jmtsuji

Hi there, @jmtsuji!

Thanks for the kind words!

I will look into adding an option for this when I can, or at least there’s certainly no reason they shouldn’t be saved with the debug flag like you tried!

You mentioned you could download them yourself, but I’ll also note have the same NCBI download functionality packaged with my bit package for this very purpose, it just takes input assembly accessions just like GToTree.

The conda install steps are here: https://github.com/AstrobioMike/bit?tab=readme-ov-file#conda-install

and then you’d want the program bit-dl-ncbi-assemblies, and passing -f protein along with the input wanted accessions would download the amino acid files if they are available. If that’s helpful to you

Thanks for the suggestion!

AstrobioMike avatar Jun 05 '24 13:06 AstrobioMike

@AstrobioMike Thanks for the quick response! Also, good to know about bit; bit-dl-ncbi-assemblies could potentially be quite useful. All the best!

jmtsuji avatar Jun 20 '24 16:06 jmtsuji

This was a looong time ago, @jmtsuji, but this is finally implemented now as of v1.8.10 :)

  • with the debug flag set while running, GToTree will keep specific files in <output_dir>/<tmp_dir>/ncbi-downloads/:
    • if amino-acid seqs are used, it will keep the downloaded amino-acid seqs
    • if there were no amino-acid seqs, and the genome had to be downloaded, it will keep the downloaded genome and the prodigal-called amino-acid seqs
    • if using nucleotide mode (-z), it will keep the downloaded genome and the prodigal-called nt cds and amino-acid seqs

It's not ideal for all situations, as the amino-acid seqs aren't going to be named as the single-copy gene they are identified as, they will instead have the names NCBI has for them. But everything downloaded will at least be saved now with the debug flag if wanting to use it for something else or look for specific things after.

Thank you for the suggestion!

AstrobioMike avatar Feb 03 '25 15:02 AstrobioMike

@AstrobioMike Wow, thanks for implementing this! I will have to give this a try the next time I am using GToTree with downloaded genomes. All the best!

jmtsuji avatar Feb 03 '25 15:02 jmtsuji