tsdate icon indicating copy to clipboard operation
tsdate copied to clipboard

Discrepancy Function

Open hfr1tz3 opened this issue 2 years ago • 22 comments

This discrepancy functions gives us a metric for comparing two tree sequences. The tree_discrepancy method returns for the discrepancy between tsa and tsb, the tuple of a. the total shared span divided by total node span in tsa (which we need to compute); so this is proprotion of the span in tsa that is accurately represented in tsb; and b. the root-mean-squared discrepancy in time, with the average weighted by span in tsa.

hfr1tz3 avatar Sep 22 '23 22:09 hfr1tz3

Awesome, let me know when you want feedback.

nspope avatar Sep 25 '23 18:09 nspope

So we need to pass over edges and pass over roots. To do so we should write a separate function. Proposal: Need to get per node total span in ts ( ie. ) w_i in sum (t_x-t_y)**2 w_i/ sum (w_i) count up in each node the total span from where it is a child which we can get from the edge table np.bincount... doesn't account for nodes which are roots so we need to include contribution of nodes who are roots.

hfr1tz3 avatar Sep 25 '23 20:09 hfr1tz3


def node_spans(ts):
    """
    Returns the array of "node spans", i.e., the `j`th entry gives
    the total span over which node `j` is in the tree (i.e., does
    not have 'missing data' there).
    """
    child_spans = np.bincount(
            ts.edges_child,
            weights=ts.edges_right - ts.edges_left,
            minlength=ts.num_nodes,
    )

    for t in ts.trees():
        span = t.interval[1] - t.interval[0]
        for r in t.roots:
            # do this check to exempt 'missing data'
            if t.num_children[r] > 0:
                child_spans[r] += span

petrelharp avatar Oct 02 '23 18:10 petrelharp

I think that is everything we need for tree_discrepancy. I think I will be ready for feedback now @nspope .

hfr1tz3 avatar Oct 02 '23 22:10 hfr1tz3

Oh sorry I just saw this! Will look ASAP.

nspope avatar Oct 09 '23 19:10 nspope

I still haven't checked if the tests from test_evaluation.py have passed yet, but I am working on trying to do that.

hfr1tz3 avatar Oct 09 '23 20:10 hfr1tz3

Hold off @nspope - we looked at this today and @hfr1tz3 has homework yet.

petrelharp avatar Oct 09 '23 23:10 petrelharp

All tests in test_evaluation.py for tree_discrepancy now pass! @petrelharp does this mean we are ready for review?

hfr1tz3 avatar Oct 25 '23 21:10 hfr1tz3

I have some minor suggestions, then yes - ready for review!

petrelharp avatar Oct 26 '23 03:10 petrelharp

Have a look, @nspope ?

petrelharp avatar Oct 27 '23 20:10 petrelharp

Nevermind, I misunderstood on my initial read-through. Looks great!

The only real suggestion I have is to rewrite the naive test suite implementation to be more naive (e.g. clearer on a quick glance). That is, don't use sparse matrix operations, but instead loop over rows of shared_spans, find the set of nodes matching the row max with np.where, get the age difference for these, and append the node with the min age difference to a list.

nspope avatar Oct 27 '23 21:10 nspope

Nevermind, I misunderstood on my initial read-through. Looks great!

Perhaps a short explanatory comment could clarify how it works - I thought it was wrong last time I looked at it also. =)

petrelharp avatar Oct 27 '23 21:10 petrelharp

We've got line ending problems; for the record here's what I did to fix them:

sed -i 's/^M$//' tsdate/evaluation.py
sed -i 's/^M$//' tests/test_evaluation.py 

petrelharp avatar Oct 30 '23 18:10 petrelharp

Is this ready for merging? I'm happy to do so if @petrelharp and/or @nspope give the OK (assume it's fine for the moment in the tsdate repo). If we merge it this week, it should make it into the next release (although I don't think it's a documented thing yet anyway, right?)

I have a feeling it might require a tskit release too, though?

hyanwong avatar Nov 08 '23 12:11 hyanwong

I think this PR is waiting on extend_edges for the tests to pass? Which might now be merged into the github head for tskit. But maybe we should wait until this is in a release -- I don't think there's an immediate rush.

nspope avatar Nov 08 '23 17:11 nspope

It is merged to tskit, but we need a release.

petrelharp avatar Nov 14 '23 05:11 petrelharp

Note: we need to ensure that the discrepancy function and time RMSE are calculated correctly when nodes have no matches. Currently, the tie would be broken by comparing to all nodes in the tree sequence (we think?). @hfr1tz3 will fix

nspope avatar Dec 11 '23 19:12 nspope

Note that when we add stuff to a tree sequence we add both right and wrong stuff; if the proprotion of wrong stuff we add is greater than the proportion of wrong stuff already there, we increase discrepancy, even if this proportion is very low. (Example: extending a true but simplified tree sequence.)

So, proposal is that tree_discrepancy(a, b) also returns the proportion of the total span of b that is matched in a; this would be

(1 - discrepancy) * total_span(a) / total_span(b)

petrelharp avatar Feb 21 '24 19:02 petrelharp

Also, a evaluation.total_span(ts) function that just does np.sum(evaluation.node_spans(ts)).

petrelharp avatar Feb 21 '24 19:02 petrelharp

I think the code does what we want but the docs had it the other way around? I could be mixed up, though - please check?

petrelharp avatar Feb 23 '24 00:02 petrelharp

Okay - if you agree with my suggested changes, then please hit the "commit change" button on them.

petrelharp avatar Feb 23 '24 05:02 petrelharp

TODO:

  • squash commits
  • rebase to main
  • double check things are ready to merge

However, this currently uses extend_edges, which we are renaming extend_haplotypes; so let's (a) make the change to extend_haplotypes; (b) check the tests run locally; (c) mark skip them so this passes; (d) make an issue to un-skip these when a new tskit with extend_haplotypes is released.

petrelharp avatar Aug 09 '24 22:08 petrelharp

Moved to #428

hyanwong avatar Feb 27 '25 08:02 hyanwong