RabbitTClust icon indicating copy to clipboard operation
RabbitTClust copied to clipboard

missing tips in newick tree

Open Djeppschmidt opened this issue 2 years ago • 3 comments

Hello,

I'm really appreciative of the newick format that you recently introduced!

I think this is a bug in building the tree. As I'm working with the newick file, it appears the newick tree is missing internal nodes; rather about half the nodes are labeled with the names that should actually be tips on the tree. For example, I ran rabbitTclust to cluster all salmonella in the NCBI pathogen database (~500k isolates) using the following code:

clust-mst -d 0.001 -l -i fasta_input.txt --newick-tree -o sal.mst.clust.0001

I generate a tree with ~270k tips, and ~238k nodes (it should have ~500k tips).

I ran a tiny version of this with 8 isolates, which produced 3 tips, and 5 internal nodes:

(((/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863221_contigs_skesa.fasta:0.000794,(/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863395_contigs_skesa.fasta:0.016157)/isilon/NCBI/SRAassemblies/skesa_contigs/SRR900926_contigs_skesa.fasta:0.000969,(/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863393_contigs_skesa.fasta:0.001294)/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863392_contigs_skesa.fasta:0.013981)/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863223_contigs_skesa.fasta:0.000000)/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863224_contigs_skesa.fasta:0.020389)/isilon/NCBI/SRAassemblies/skesa_contigs/SRR863396_contigs_skesa.fasta;

This makes it impossible to filter the tree by tips because half the isolates are actually node labels, when I believe they should be tip labels.

I'm curious if anyone else is experiencing this issue? Or maybe I'm missing something?

Thanks for you help, Dietrich

Djeppschmidt avatar Jul 13 '23 18:07 Djeppschmidt

Hi Dietrich,

Thanks for your issue!

I am currently on a business trip this month. I will check it and let you know when I have progress.

Best, Xiaoming

XiaomingXu1995 avatar Jul 14 '23 11:07 XiaomingXu1995

Hi Xiaoming,

I got the same issue as Dietrich as I was hoping the tree to output each cluster as a leave. Instead, most clusters are actually present as named internal nodes.

Did you had some times to look into it ? I believe the reason could come from the presence of 0 in the distance matrix as some sequence could be considered as subsets of the other. Maybe replacing those 0 by a really small distance value could produce what Dietrich and I would expect.

If you could pinpoint in your code where the newick tree is done, I could look more into it.

Best, Arnaud

avw-adifranco avatar Nov 24 '23 09:11 avw-adifranco

Apologies for the delayed response.

The Newick Tree in RabbitTClust represents the output format of the Minimum Spanning Tree generated in clust-mst. Unfortunately, it is not possible to designate all genome nodes as leaf nodes, as the connections of the edges in the Minimum Spanning Tree are dependent on internal nodes.

Best, Xiaoming

XiaomingXu1995 avatar Nov 29 '23 06:11 XiaomingXu1995