beast2 icon indicating copy to clipboard operation
beast2 copied to clipboard

Trie in State can lead to memory leak

Open rbouckaert opened this issue 4 years ago • 2 comments

The trie in the State is a data structure that stores paths through the calculation graph based on which set of StateNodes are changed by operators. Some models have operators that select random subsets of StateNodes, leading to uncontrolled growth of the trie. A crude solution used in StartBeast2 is to mark all StateNodes (that are gene trees) as dirty, so only a single entry in the trie is created, at the cost of having the state and calculation nodes doing more work checking whether things should be recalculated.

Possible solutions:

  • remove the trie and recalculate the path after every operator proposal. A quick run with StarBeast3 shows this increases runtime from about 3h/million states to 3h30/million states, so can be quite inefficient.
  • automatically limit the size of the trie by removing entries when the trie becomes too large or just stop adding new entries after a set size is reached (since the random subset operators will be the only ones adding new entries after some time, but the non-random subset ones will be called before the time limit is reached, perhaps this is the most efficient without developer intervention).
  • let the operator signal that the path should not be added to the trie programmatically.

The first two require no developer knowledge, but are possibly less efficient than the last solution. Irrespective of implementing the 3rd solution, the 2nd seems a good way to limit the amount of memory gobbled up by the trie.

rbouckaert avatar Jul 20 '21 22:07 rbouckaert

Hi Remco, just wanted to chime in with a vote for any solution which avoids affecting analyses which don't include such operators. It'd be a shame to slow things down even marginally for the sake of this edge case; your trie works really well everywhere else!

tgvaughan avatar Jul 21 '21 08:07 tgvaughan

Hi Tim, I agree. From what I've seen, most operators apply to a single StateNode, and there will only be a single path associated with such single StateNode update. This means there are typically only a limited number of paths. I set the upper limit to 1000 for now. Do you think there are analyses that require more than a 1000 unique calculation node paths? What makes a reasonable upper bound?

rbouckaert avatar Jul 21 '21 19:07 rbouckaert