tsinfer Identifying Error from tsinfer Trees

A medium to long-term goal is to see if one could piece apart the precise signal of error in inferred tree sequences. Sites with hundreds of mutations are almost certainly erroneous, but what about sample mutations and long edges leading directly to the root? tsdate might be the better place for this issue given that it relies on mutations from tsinfer, but thought I'd place it here for future discussion.

Dec 15 '20 21:12 awohns

Having a decent metric for what are errors, and how to spot them, might also be a good way of testing for the optimal values for the mismatch parameter.

Dec 15 '20 21:12 hyanwong

Some thoughts I had - the error signal is dependant on the error model. It might be good to get a feel for the problem by starting with some simulations and a simple error model such as adding sites where the genotype is random, or random for a subset of samples.

Dec 16 '20 00:12 benjeffery

For this issue I think it would be really useful to dig into https://github.com/tskit-dev/tsinfer/issues/568 to start with.

Oct 26 '22 10:10 hyanwong