NetRAX icon indicating copy to clipboard operation
NetRAX copied to clipboard

Comparison with PhyLiNC, PhyloDAG, SNAQ, PhyloNET MPL, and PhyloNET ML on simulated data

Open lutteropp opened this issue 4 years ago • 34 comments

The complete PhyLiNC output for a simulated 10 taxon 1 reticulation 2000 MSA sites dataset, on the PhD laptop. I set the maximum number of reticulations it should try to 2, and turns out that PhyLiNC overshooted and inferred a 2-reticulation network then. As they use the unlinked sites original NEPAL likelihood model, this is expected. We had the same problem with that model back then. Also, PhyLiNC had some more issues and errors down the line. phylinc_output.txt

lutteropp avatar Aug 08 '21 22:08 lutteropp

The simulated dataset, the RAxML-NG best ML tree, the PhyLINC inferred network, and networks inferrred by several NetRAX variants datasets_phylinc_exp_smaller.zip

lutteropp avatar Aug 14 '21 11:08 lutteropp

PhyLINC result on the PhD laptop, with max_reticulations set to 2, start from RAxML-NG best ML tree: Total inference runtime: 38365.49 seconds. Inferred a network with 2 reticulations. Printed multiple error messages (ERROR found on PhyLiNC for run 5 seed 17293: │ RootMismatch: non-leaf node 22 had 0 children. │ Could be a hybrid whose parents' direction conflicts with the root. │ isChild1 and containRoot were updated for a subset of edges in the network only.)


NetRAX results on the PhD laptop for the simulated 10-taxon 1-reticulatiion dataset:

  • Start from RAxML-NG best ML tree, with LikelihoodModel.AVERAGE: Total inference runtime: 3.0 seconds. Best inferred network has 1 reticulations, logl = -15123.43304, bic = 30643.00558 0_0_single_average_result.txt

  • Start from all unique trees with 10 random and 10 parsimony, with LikelihoodModel.AVERAGE: 11 unique start tree topologies. Total inference runtime: 40.0 seconds. Best inferred network has 1 reticulations, logl = -15123.32165, bic = 30642.7828, logl = -727209.282, bic = 1499403.751 0_0_multi_average_result.txt

  • Start from RAxML-NG best ML tree, with LikelihoodModel.BEST Total inference runtime: 2.0 seconds. Best inferred network has 1 reticulations, logl = -15123.43304, bic = 30643.00558 0_0_single_best_result.txt

  • Start from all unique trees with 10 random and 10 parsimony, with LikelihoodModel.BEST: 11 unique start tree topologies. Total inference runtime: 28.0 seconds. Best inferred network has 1 reticulations, logl = -15123.32165, bic = 30642.7828 0_0_multi_best_result.txt

lutteropp avatar Aug 14 '21 12:08 lutteropp

I am also including PhyloDAG in this comparison. Here the data to run PhyloDAG on the dataset: data_for_phylodag.zip

lutteropp avatar Aug 14 '21 12:08 lutteropp

The PhyloDAG inference already finished. It took 3.308089 mins, ran only single-threaded, and inferred this network, with 1 reticulation and loglikelihood -17771.85: Screenshot from 2021-08-14 14-32-09

lutteropp avatar Aug 14 '21 12:08 lutteropp

We need to also compare NetRAX and PhyloDAG on a larger dataset. Let's say 30 taxa, 3 reticulations. I am using the dataset from experiment D (the scrambling one) for it.

lutteropp avatar Aug 14 '21 12:08 lutteropp

In this archive, we have:

  • 0_0.nex: The input file for PhyloDAG for the 10 taxa 1 reticulation dataset
  • D.nex: The input file for PhyloDAG for the 30 taxa 3 reticulations dataset
  • simulated_0_0.R: The R script calling PhyloDAG on the 10 taxa 1 reticulation dataset
  • D.R: The R script calling PhyloDAG on the 30 taxa 3 reticulations dataset

data_for_phylodag_2.zip

lutteropp avatar Aug 14 '21 13:08 lutteropp

I aborted the 30 taxa 3 reticulations run on PhyloDAG since it kept running for ages. Trying with a newly simulated 20 taxa 2 reticulations 4k MSA sites dataset now:

phylodag_data_20t2r.zip

lutteropp avatar Aug 14 '21 14:08 lutteropp

Very interesting! PhyloDAG on 20 taxa 2 reticulations dataset finished, and it's result sucks really hard: Total runtime: 20.42073 mins

Inferred network picture: 20t2r_phylodag_network

lutteropp avatar Aug 14 '21 14:08 lutteropp

NetRAX results on the PhD laptop for the simulated 20-taxon 2-reticulatiion dataset:

  • Start from RAxML-NG best ML tree, with LikelihoodModel.AVERAGE: Total inference runtime: 63.0 seconds. Best inferred network has 2 reticulations, logl = -47915.4701, bic = 96756.70232 20t_2r_single_average_result.txt

  • Start from all unique trees with 10 random and 10 parsimony, with LikelihoodModel.AVERAGE: 16 unique start tree topologies. Total inference runtime: 1475.0 seconds. Best inferred network has 2 reticulations, logl = -47914.83028, bic = 96755.42267 20t_2r_multi_average_result.txt

  • Start from RAxML-NG best ML tree, with LikelihoodModel.BEST Total inference runtime: 51.0 seconds. Best inferred network has 2 reticulations, logl = -47915.54511, bic = 96756.85233 20t_2r_single_best_result.txt

  • Start from all unique trees with 10 random and 10 parsimony, with LikelihoodModel.BEST: 16 unique start tree topologies. Total inference runtime: 1031.0 seconds. Best inferred network has 2 reticulations, logl = -47914.7023, bic = 96755.16672 20t_2r_multi_best_result.txt

lutteropp avatar Aug 14 '21 15:08 lutteropp

I retried PhyloDAG with their default parameters (before I used the parameters stated in their example file). This time, I got:

  • For the 10 taxa, 1 reticulation, 2k MSA sites simulated dataset: Total runtime: 2.415105 mins Loglikelihood: -15794.61 phylodag_10t1r_default

  • For the 20 taxa, 2 reticulations, 4k MSA sites simulated dataset: Total runtime: 16.88632 mins Loglikelihood: -60021.8 phylodag_20t2r_default

lutteropp avatar Aug 14 '21 16:08 lutteropp

RAxML-NG best tree ML inference runtime (starting from 10 random + 10 parsimony trees) on the PhD laptop was:

  • for 10 taxa, 1 reticulation: 5.688 seconds
  • for 20 taxa, 2 reticulations: 49.811 seconds

lutteropp avatar Aug 14 '21 16:08 lutteropp

I hand-wrote the Extended NEWICK for the PhyloDAG network, for the 10 taxa 1 reticulation dataset, using its default parameters: phylodag_10t1r_inferred_network.txt

lutteropp avatar Aug 14 '21 16:08 lutteropp

Re-running PhyloDAG with the same parameters gives me totally different networks every time.

lutteropp avatar Aug 14 '21 16:08 lutteropp

I also started yet another PhyLiNC inference run on the simulated 10 taxa 1 reticulation dataset, this time with telling it that the maximum number of reticulations to try is 1. It is currently still running, I expect it to take multiple hours, but less than a day on the PhD laptop.

lutteropp avatar Aug 14 '21 16:08 lutteropp

The PhyLiNC output with maximum number of reticulations set to 1, this time without any weird error messages: phylinc_output_maxret_1.txt

lutteropp avatar Aug 15 '21 06:08 lutteropp

I hate extra work, but it would be awesome if we would also compare with SNAQ, PhyloNet ML, and PhyloNet PseudoML on our simulated data. Instead of a MSA, these tools require a set of gene trees. Since we have very few "genes" here (just 2^num_reticulations), I expect the tools to be pretty fast.

As a first step for these inferences, I am inferring the "gene trees" with RAxML-NG, using the PhD laptop.

lutteropp avatar Aug 15 '21 07:08 lutteropp

First, the per-gene MSAs, built through variations of this very nice and useful command: awk '{if(/^>/)print $0; else print substr($0,1,1000)}' 20t_2r_msa.txt > 20t_2r_gene1_msa.txt

For the 10 taxa, 1 reticulation dataset: 10t_1r_gene2_msa.txt 10t_1r_gene1_msa.txt

For the 20 taxa, 2 reticulations dataset: 20t_2r_gene4_msa.txt 20t_2r_gene3_msa.txt 20t_2r_gene2_msa.txt 20t_2r_gene1_msa.txt

lutteropp avatar Aug 15 '21 07:08 lutteropp

These are the "gene trees" inferred by RAxML-NG, and the logfiles:

For the 10 taxa, 1 reticulation dataset: 10t_1r_gene2_msa.txt.raxml.bestTree.txt 10t_1r_gene1_msa.txt.raxml.bestTree.txt

10t_1r_gene2_msa.txt.raxml.log.txt 10t_1r_gene1_msa.txt.raxml.log.txt

For the 20 taxa, 2 reticulations dataset: 20t_2r_gene4_msa.txt.raxml.bestTree.txt 20t_2r_gene3_msa.txt.raxml.bestTree.txt 20t_2r_gene2_msa.txt.raxml.bestTree.txt 20t_2r_gene1_msa.txt.raxml.bestTree.txt

20t_2r_gene4_msa.txt.raxml.log.txt 20t_2r_gene3_msa.txt.raxml.log.txt 20t_2r_gene2_msa.txt.raxml.log.txt 20t_2r_gene1_msa.txt.raxml.log.txt

Total RAxML inference runtimes for the "gene trees":

  • For the 10 taxa, 1 reticulation dataset: 2.416 seconds + 2.164 seconds = 4.58 seconds
  • For the 20 taxa, 2 reticulations dataset: 9.824 seconds + 20.497 seconds + 18.037 seconds + 9.495 seconds = 57.853 seconds

lutteropp avatar Aug 15 '21 08:08 lutteropp

Apparently SNAQ requires a set of gene trees in 1 file, and 1 start tree. So here's the input data for SNAQ:

lutteropp avatar Aug 15 '21 08:08 lutteropp

Here are the SNAQ results for the 10 taxa 1 reticulation dataset.

lutteropp avatar Aug 15 '21 11:08 lutteropp

lutteropp avatar Aug 15 '21 13:08 lutteropp

Turns out both SNAQ and PhyLiNC overestimate the number of reticulations if I tell them to try for at most 2 reticulations.

lutteropp avatar Aug 15 '21 13:08 lutteropp

NEXUS Submission files for PhyloNET, for the 10 taxa 1 reticulation dataset: 10_1r_phylonet_submission_files.zip

lutteropp avatar Aug 15 '21 13:08 lutteropp

PhyloNET MPL (Maximum Pseudolikelihood) results for the 10 taxa 1 reticulation dataset:

PhyloNET ML (Maximum Likelihood) results for the 10 taxa 1 reticulation dataset:

lutteropp avatar Aug 15 '21 14:08 lutteropp

The judge results, for all networks on the 10 taxa 1 reticulation dataset we have so far: (with SNAQ, I had to manually fix the inferred 2-reticulation network that had one reticulation with probability 0/1) judge_phylonet_ml_maxret_1.txt judge_phylonet_mpl_maxret_2.txt judge_phylonet_mpl_maxret_1.txt judge_snaq_maxret_2.txt judge_snaq_maxret_1.txt judge_phylodag.txt judge_phylinc_maxret_2.txt judge_phylinc_maxret_1.txt judge_netrax_single_average.txt judge_netrax_multi_average.txt judge_netrax_multi_best.txt judge_netrax_single_best.txt

lutteropp avatar Aug 15 '21 17:08 lutteropp

PhyloNET ML with 2 reticulations max on the 10 taxa 1 reticulation dataset finished its first out of 5 runs on the PhD laptop (it inferred a 2-reticulation network). It took 3 hours for that single run, already running in parallel with 4 threads! Thus, this inference will likely be finished in about 12 hours from now.

lutteropp avatar Aug 15 '21 19:08 lutteropp

No more progress on the PhyloNET ML with 2 reticulations max run. I cannot tell if it maybe got stuck in an endless loop or so, it does not print any progress output to the command line.

lutteropp avatar Aug 16 '21 13:08 lutteropp

And the judge results for PhyloNET MP: judge_phylonet_mp_maxret_1.txt judge_phylonet_mp_maxret_2.txt

lutteropp avatar Aug 16 '21 16:08 lutteropp