ipyrad Popgen sumstats output

Over the past couple days I've gotten pretty handy with scikit-allel, a nice python package for generating simple popgen summary stats. I think it would be relatively straightforward to include a block in step6 to write these out to a file/dir. If we have population assignment files then we can even do Fst and Dxy. Many people i have talked to have requested this feature, so i think it could add a lot of value for folks, lots of bang for the buck.

Dec 04 '15 22:12 isaacovercast

Cool! I've been wrestling with whether or not we should use dask, and also hdf5, but if scikit-allel uses both and we're going to use that then I'll feel free to use either in the code. I had been thinking that both might be particularly useful for reconstructing all of the coverage data into a vcf file in step 7.

Having popgen stats functions would be great. Some ideas that come to mind:

Calculate sumstats in step7 to allow as a filter on assemblies (e.g., max Fst)
Create example notebooks of how to calculate stats from output vcf using pure scikit-allel.
Or, since RAD loci are much simpler than full genome data, we could create our own much simpler functions in the ipyrad.analysis module that use scikit-allel to calculate RAD specific values. This would be nice because it provides simpler stats functions for users who do not want to spend time learning scikit-allel, numpy arrays, hdf5, etc.

Dec 05 '15 18:12 dereneaton

Hi guys, just wondering if this was ever implemented. Thanks! iPyrad rocks!!

Apr 23 '20 22:04 vwishingrad

Hello @vwishingrad, yeah we haven't implemented this yet. It's still a good idea though! It just hasn't happened yet.... Thanks for the positive feedback! Glad you like ipyrad. Tell your friends! Hopefully we will implement the sumstats analysis tool some time in the near future.... Thanks again for your feedback.

Apr 23 '20 22:04 isaacovercast

Well, now there's an ipyrad.analysis.popgen tool. It's not perfect but it's a start. This will give you a bunch of information, still working on organizing it better:

Average pi, Watterson's theta, and Tajima's D per population
Per locus values for all these
Average Fst and Dxy among populations
Per locus Fst and Dxy among populations
A file in workdir per population for per site pi values within each locus

TODO:

clean up the cookbook notebook
Test more on real data.
Verify and ground-truth the calculations
Verify samples with missing data at a given locus behave themselves (simulated data is too clean)
Various plotting functions can be imagined
The way the stats are compiled on the engines and passed back to the main process is super whack (a massive ugly pickled dictionary), consider refactoring to remove ugliness.
The way the stats are compiled in the main process is ugly and whack and probably slow.

Feb 03 '21 10:02 isaacovercast

Hi, is it possible to use this tool at the moment? if so, how would one go about doing it? Thanks! and this would be an amazing tool once fully implemented!

May 02 '21 12:05 dorteneum