tsdate Discrepancy Function

This discrepancy functions gives us a metric for comparing two tree sequences. The tree_discrepancy method returns for the discrepancy between tsa and tsb, the tuple of a. the total shared span divided by total node span in tsa (which we need to compute); so this is proprotion of the span in tsa that is accurately represented in tsb; and b. the root-mean-squared discrepancy in time, with the average weighted by span in tsa.

Sep 22 '23 22:09 hfr1tz3

Awesome, let me know when you want feedback.

Sep 25 '23 18:09 nspope

So we need to pass over edges and pass over roots. To do so we should write a separate function. Proposal: Need to get per node total span in ts ( ie. ) w_i in sum (t_x-t_y)**2 w_i/ sum (w_i) count up in each node the total span from where it is a child which we can get from the edge table np.bincount... doesn't account for nodes which are roots so we need to include contribution of nodes who are roots.

Sep 25 '23 20:09 hfr1tz3


def node_spans(ts):
    """
    Returns the array of "node spans", i.e., the `j`th entry gives
    the total span over which node `j` is in the tree (i.e., does
    not have 'missing data' there).
    """
    child_spans = np.bincount(
            ts.edges_child,
            weights=ts.edges_right - ts.edges_left,
            minlength=ts.num_nodes,
    )

    for t in ts.trees():
        span = t.interval[1] - t.interval[0]
        for r in t.roots:
            # do this check to exempt 'missing data'
            if t.num_children[r] > 0:
                child_spans[r] += span

Oct 02 '23 18:10 petrelharp

I think that is everything we need for tree_discrepancy. I think I will be ready for feedback now @nspope .

Oct 02 '23 22:10 hfr1tz3

Oh sorry I just saw this! Will look ASAP.

Oct 09 '23 19:10 nspope

I still haven't checked if the tests from test_evaluation.py have passed yet, but I am working on trying to do that.

Oct 09 '23 20:10 hfr1tz3

Hold off @nspope - we looked at this today and @hfr1tz3 has homework yet.

Oct 09 '23 23:10 petrelharp

All tests in test_evaluation.py for tree_discrepancy now pass! @petrelharp does this mean we are ready for review?

Oct 25 '23 21:10 hfr1tz3

I have some minor suggestions, then yes - ready for review!

Oct 26 '23 03:10 petrelharp

Have a look, @nspope ?

Oct 27 '23 20:10 petrelharp

Nevermind, I misunderstood on my initial read-through. Looks great!

The only real suggestion I have is to rewrite the naive test suite implementation to be more naive (e.g. clearer on a quick glance). That is, don't use sparse matrix operations, but instead loop over rows of shared_spans, find the set of nodes matching the row max with np.where, get the age difference for these, and append the node with the min age difference to a list.

Oct 27 '23 21:10 nspope

Nevermind, I misunderstood on my initial read-through. Looks great!

Perhaps a short explanatory comment could clarify how it works - I thought it was wrong last time I looked at it also. =)

Oct 27 '23 21:10 petrelharp

We've got line ending problems; for the record here's what I did to fix them:

sed -i 's/^M$//' tsdate/evaluation.py
sed -i 's/^M$//' tests/test_evaluation.py

Oct 30 '23 18:10 petrelharp

Is this ready for merging? I'm happy to do so if @petrelharp and/or @nspope give the OK (assume it's fine for the moment in the tsdate repo). If we merge it this week, it should make it into the next release (although I don't think it's a documented thing yet anyway, right?)

I have a feeling it might require a tskit release too, though?

Nov 08 '23 12:11 hyanwong

I think this PR is waiting on extend_edges for the tests to pass? Which might now be merged into the github head for tskit. But maybe we should wait until this is in a release -- I don't think there's an immediate rush.

Nov 08 '23 17:11 nspope

It is merged to tskit, but we need a release.

Nov 14 '23 05:11 petrelharp

Note: we need to ensure that the discrepancy function and time RMSE are calculated correctly when nodes have no matches. Currently, the tie would be broken by comparing to all nodes in the tree sequence (we think?). @hfr1tz3 will fix

Dec 11 '23 19:12 nspope

Note that when we add stuff to a tree sequence we add both right and wrong stuff; if the proprotion of wrong stuff we add is greater than the proportion of wrong stuff already there, we increase discrepancy, even if this proportion is very low. (Example: extending a true but simplified tree sequence.)

So, proposal is that tree_discrepancy(a, b) also returns the proportion of the total span of b that is matched in a; this would be

(1 - discrepancy) * total_span(a) / total_span(b)

Feb 21 '24 19:02 petrelharp

Also, a evaluation.total_span(ts) function that just does np.sum(evaluation.node_spans(ts)).

Feb 21 '24 19:02 petrelharp

I think the code does what we want but the docs had it the other way around? I could be mixed up, though - please check?

Feb 23 '24 00:02 petrelharp

Okay - if you agree with my suggested changes, then please hit the "commit change" button on them.

Feb 23 '24 05:02 petrelharp

TODO:

squash commits
rebase to main
double check things are ready to merge

However, this currently uses extend_edges, which we are renaming extend_haplotypes; so let's (a) make the change to extend_haplotypes; (b) check the tests run locally; (c) mark skip them so this passes; (d) make an issue to un-skip these when a new tskit with extend_haplotypes is released.

Aug 09 '24 22:08 petrelharp

Moved to #428

Feb 27 '25 08:02 hyanwong