tsinfer icon indicating copy to clipboard operation
tsinfer copied to clipboard

tsinfer qc script

Open awohns opened this issue 5 years ago • 7 comments

A script to systematise qc checks of tree sequence quality would be really helpful. This could be a script called via the CLI 'tsinfer-qc' or something like that. We've found that the following are good indicators:

  • Number of edges (possibly ratio of edges to sites)
  • How often do samples "go to the root".
  • High frequency edges that span more of the genome that we would expect
  • Phasing likely errors and close breakpoints

I'm sure I'm missing some @jeromekelleher and @hyanwong

awohns avatar May 15 '20 12:05 awohns

I think it would be good to add this as another subcommand for the tsinfer program, which we'd run like

tsinfer qc <sample data file> <output tree sequence>

We'd then have various options for what it would output/compute. We might even consider having it output some plots as well as text analysis (I think it's fine if we depend on matplotlib).

It's debatable whether we need the samples file. Maybe not, and we can just look at an input tree sequence.

jeromekelleher avatar May 15 '20 13:05 jeromekelleher

I've written a qc script that returns the number of close breakpoints and potential phasing switch errors for each individual. It runs like this: python phasing_qc.py inferredtreesequence output_filename_prefix how_close_are_breakpoints individual_id_name_in_metadata

I also have minimal unit testing of the output. Shall I make a PR @jeromekelleher ?

awohns avatar Jun 03 '20 08:06 awohns

Sure, let's take a look @awohns!

jeromekelleher avatar Jun 03 '20 16:06 jeromekelleher

I think @szhan would be well placed to take this on.

hyanwong avatar Oct 26 '22 11:10 hyanwong

fyi, I'm doing some of this very stuff right now on our mosquito trees. happy to share

gtsambos avatar Oct 26 '22 19:10 gtsambos

That would be great, thanks. Maybe you could sync up with @szhan to decide on some sensible metrics? It would be really useful to have some guidelines for other people here.

hyanwong avatar Oct 26 '22 19:10 hyanwong

Cool, will do. So far I've been looking mostly at polytomies and where they are on the trees, but some of these other things Wilder mentioned all this time ago are things we've been discussing too. @szhan will send you some stuff in Slack a bit later

gtsambos avatar Oct 26 '22 19:10 gtsambos