Discrepancy Function
This discrepancy functions gives us a metric for comparing two tree sequences.
The tree_discrepancy method returns for the discrepancy between tsa and tsb, the tuple of
a. the total shared span divided by total node span in tsa (which we need to compute); so this is proprotion of the span in tsa that is accurately represented in tsb; and
b. the root-mean-squared discrepancy in time, with the average weighted by span in tsa.
Awesome, let me know when you want feedback.
So we need to pass over edges and pass over roots. To do so we should write a separate function. Proposal: Need to get per node total span in ts ( ie. ) w_i in sum (t_x-t_y)**2 w_i/ sum (w_i) count up in each node the total span from where it is a child which we can get from the edge table np.bincount... doesn't account for nodes which are roots so we need to include contribution of nodes who are roots.
def node_spans(ts):
"""
Returns the array of "node spans", i.e., the `j`th entry gives
the total span over which node `j` is in the tree (i.e., does
not have 'missing data' there).
"""
child_spans = np.bincount(
ts.edges_child,
weights=ts.edges_right - ts.edges_left,
minlength=ts.num_nodes,
)
for t in ts.trees():
span = t.interval[1] - t.interval[0]
for r in t.roots:
# do this check to exempt 'missing data'
if t.num_children[r] > 0:
child_spans[r] += span
I think that is everything we need for tree_discrepancy. I think I will be ready for feedback now @nspope .
Oh sorry I just saw this! Will look ASAP.
I still haven't checked if the tests from test_evaluation.py have passed yet, but I am working on trying to do that.
Hold off @nspope - we looked at this today and @hfr1tz3 has homework yet.
All tests in test_evaluation.py for tree_discrepancy now pass!
@petrelharp does this mean we are ready for review?
I have some minor suggestions, then yes - ready for review!
Have a look, @nspope ?
Nevermind, I misunderstood on my initial read-through. Looks great!
The only real suggestion I have is to rewrite the naive test suite implementation to be more naive (e.g. clearer on a quick glance). That is, don't use sparse matrix operations, but instead loop over rows of shared_spans, find the set of nodes matching the row max with np.where, get the age difference for these, and append the node with the min age difference to a list.
Nevermind, I misunderstood on my initial read-through. Looks great!
Perhaps a short explanatory comment could clarify how it works - I thought it was wrong last time I looked at it also. =)
We've got line ending problems; for the record here's what I did to fix them:
sed -i 's/^M$//' tsdate/evaluation.py
sed -i 's/^M$//' tests/test_evaluation.py
Is this ready for merging? I'm happy to do so if @petrelharp and/or @nspope give the OK (assume it's fine for the moment in the tsdate repo). If we merge it this week, it should make it into the next release (although I don't think it's a documented thing yet anyway, right?)
I have a feeling it might require a tskit release too, though?
I think this PR is waiting on extend_edges for the tests to pass? Which might now be merged into the github head for tskit. But maybe we should wait until this is in a release -- I don't think there's an immediate rush.
It is merged to tskit, but we need a release.
Note: we need to ensure that the discrepancy function and time RMSE are calculated correctly when nodes have no matches. Currently, the tie would be broken by comparing to all nodes in the tree sequence (we think?). @hfr1tz3 will fix
Note that when we add stuff to a tree sequence we add both right and wrong stuff; if the proprotion of wrong stuff we add is greater than the proportion of wrong stuff already there, we increase discrepancy, even if this proportion is very low. (Example: extending a true but simplified tree sequence.)
So, proposal is that tree_discrepancy(a, b) also returns the proportion of the total span of b that is matched in a; this would be
(1 - discrepancy) * total_span(a) / total_span(b)
Also, a evaluation.total_span(ts) function that just does np.sum(evaluation.node_spans(ts)).
I think the code does what we want but the docs had it the other way around? I could be mixed up, though - please check?
Okay - if you agree with my suggested changes, then please hit the "commit change" button on them.
TODO:
- squash commits
- rebase to
main - double check things are ready to merge
However, this currently uses extend_edges, which we are renaming extend_haplotypes; so let's (a) make the change to extend_haplotypes; (b) check the tests run locally; (c) mark skip them so this passes; (d) make an issue to un-skip these when a new tskit with extend_haplotypes is released.
Moved to #428