tsdate Comparing dates between simulated and inferred tree sequences

For tree sequences inferred from simulations, we can't directly compare inferred dates to true dates because node sets differ between the inference and the simulation.

Instead, we've been looking at summary statistics such as the distribution of pairwise coalescence times, relative to what is expected under the coalescent with recombination. Unfortunately, time-indexed statistics are likely to be biased if there are polytomies in the inferred tree sequence, because the degrees of the polytomies are likely to depend on node ages. It's not clear to me how to systematically characterize/correct for these biases, so as to assess whether or not the inferred tree sequence has reasonable estimates for node dates. For example, polytomies may contribute to the distortions seen in https://github.com/tskit-dev/tsdate/issues/198#issuecomment-1631224197. This seems especially problematic for comparing tsdate to other inference tools like Relate that only output binary trees. Also, marginal statistics are a pretty indirect way to get at what we've after (node-level predictions).

@petrelharp and @hfr1tz3 pointed me towards a tree discrepancy metric they developed in https://github.com/petrelharp/num_edges/blob/main/Discrepancy%20Function.ipynb, that gives a meaningful way to compare ages across different tree sequences by finding a mapping between nodes. For a given node in the inferred tree sequence, the idea is to find the node in the true tree sequence that shares the same set of descendants over the longest shared span. This is something like finding the "true" ancestral segment that is most similar to a given inferred ancestral segment.

This seems to work quite well on a small (1Mb, 200 samples) example, plotting the inferred ages against the ages of the "most similar" nodes from the simulation:

and suggests that tsdate is doing better than we'd judge from the global statistics in https://github.com/tskit-dev/tsdate/issues/198#issuecomment-1631224197. It'd be great to make this fast enough to work with larger simulations. Code:

import msprime
import tsinfer
import tsdate
import tskit
import numpy as np
import matplotlib.pyplot as plt

def match_node(node, inferred_ts, true_ts):
    """
    For `node` in `inferred_ts`, return the age of the node in `true_ts` 
    that shares the longest span with the same set of descendant samples
    as `node`. From:

    https://github.com/petrelharp/num_edges/blob/main/Discrepancy%20Function.ipynb
    """
    shared_span = np.zeros(true_ts.num_nodes)
    node_span = 0.0
    for interval, t1, t2 in inferred_ts.coiterate(true_ts):
        desc_set = set(t1.samples(node))
        if len(desc_set) > 0:
            tree_span = interval.span
            node_span += tree_span
            if len(desc_set) == 1:
                x = list(desc_set)[0]
                y = list(desc_set)[0]
            else:
                x = t1.mrca(*list(desc_set))
                y = t2.mrca(*list(desc_set))
            if set(t2.samples(y)) == desc_set:
                shared_span[y] += tree_span
                y = t2.parent(y)
                if node != x:
                    while (
                        y != tskit.NULL and 
                        set(t2.samples(y)) == desc_set
                    ):
                        shared_span[y] += tree_span
                        y = t2.parent(y)
    assert np.max(shared_span) <= node_span
    node_match = np.argmax(shared_span)
    return (
        true_ts.nodes_time[node_match], 
        shared_span[node_match] / node_span
    )

def compare_node_ages(inferred_ts, true_ts):
    """
    For each node in `inferred_ts`, return: the estimate age; the age of the
    best-matching node in `true_ts`; and the proportion of the span of the
    inferred node over which the matching node has the same set of descendant
    samples.

    **SLOW**
    """
    assert inferred_ts.num_samples == true_ts.num_samples
    inferred_time = inferred_ts.tables.nodes.time
    matched_time = np.zeros(inferred_ts.num_nodes)
    shared_span = np.ones(inferred_ts.num_nodes)
    for i in range(inferred_ts.num_samples, inferred_ts.num_nodes):
        matched_time[i], shared_span[i] = match_node(i, inferred_ts, true_ts)
    return inferred_time, matched_time, shared_span

# ---- simulated example ----

ts = msprime.sim_ancestry(
    samples=100, sequence_length=1e6, population_size=1e4, 
    recombination_rate=1e-8, coalescing_segments_only=False, 
    random_seed=1024,
)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1024)
sample_data = tsinfer.SampleData.from_tree_sequence(ts)
ets = tsinfer.infer(sample_data).simplify(filter_sites=False)
ets = tsdate.date(
    ets, method='variational_gamma', mutation_rate=1e-8, 
    population_size=1e4, progress=True, global_prior=True,
)

inferred_times, true_times, shared_spans = compare_node_ages(ets, ts)

plt.scatter(
    np.log10(true_times[shared_spans > 0]),
    np.log10(inferred_times[shared_spans > 0]), 
    s=4, c='firebrick'
)
plt.axline((0,0), slope=1, c="black")
plt.xlabel("Min-discrepancy true time (log)")
plt.ylabel("Inferred time (log)")
plt.savefig("min_disc_time.png")

Jul 22 '23 06:07 nspope

This is very interesting, thanks @nspope. I'm thinking that one of the strategies of this method is not to look at all the nodes, but simply focus upon those present in the inferred tree sequence. However, in in inferred TS with no mismatch, each mutation should be the source of a single ancestor, so this may not be that different to identifying nodes by the mutations directly above them? Maybe we should plot those as different colours on the plot above, to contrast?

It might be worth summarising discrepancies from the predicted by looking at the histogram too (as in the Brandt paper). I wonder if it would be helpful to plot this?

Jul 22 '23 11:07 hyanwong

so this may not be that different to identifying nodes by the mutations directly above them

Yes, I think that's right -- should coincide closely in this case. Though, it's useful to get a measure of similarity (between query node and "best-matching" true node), because these may be used to filter / weight the result.

It might be worth summarising discrepancies from the predicted by looking at the histogram too (as in the Brandt paper). I wonder if it would be helpful to plot this?

Yes, although I think that when nodes are omitted due to polytomies it'll create artefacts in marginal statistics that don't really reflect dating error. For example, if calculating the distribution of pair coalescence times, a tree with large polytomies should have a dearth of younger pair-times relative to a similar binary tree. That's not really an issue with dating per se, as the node ages themselves could be totally reasonable.

Jul 22 '23 16:07 nspope

so this may not be that different to identifying nodes by the mutations directly above them

Yes, I think that's right -- should coincide closely in this case.

Anyway, it's useful to have a measure that doesn't rely on having any mutations in the tree sequence; that is "branch only".

Jul 22 '23 17:07 hyanwong

As a side note, I could imagine doing something like this by (a) removing mutations from the original ts; (b) adding new mutations under infinite-sites with msprime to the original ts; (c) mapping the mutations to the new ts by parsimony; (d) identifying nodes this way (and in the process also finding out how often we need multiple mutations and where).

Jul 22 '23 19:07 petrelharp