Popgen sumstats output
Over the past couple days I've gotten pretty handy with scikit-allel, a nice python package for generating simple popgen summary stats. I think it would be relatively straightforward to include a block in step6 to write these out to a file/dir. If we have population assignment files then we can even do Fst and Dxy. Many people i have talked to have requested this feature, so i think it could add a lot of value for folks, lots of bang for the buck.
Cool! I've been wrestling with whether or not we should use dask, and also hdf5, but if scikit-allel uses both and we're going to use that then I'll feel free to use either in the code. I had been thinking that both might be particularly useful for reconstructing all of the coverage data into a vcf file in step 7.
Having popgen stats functions would be great. Some ideas that come to mind:
- Calculate sumstats in step7 to allow as a filter on assemblies (e.g., max Fst)
- Create example notebooks of how to calculate stats from output vcf using pure scikit-allel.
- Or, since RAD loci are much simpler than full genome data, we could create our own much simpler functions in the
ipyrad.analysismodule that use scikit-allel to calculate RAD specific values. This would be nice because it provides simpler stats functions for users who do not want to spend time learning scikit-allel, numpy arrays, hdf5, etc.
Hi guys, just wondering if this was ever implemented. Thanks! iPyrad rocks!!
Hello @vwishingrad, yeah we haven't implemented this yet. It's still a good idea though! It just hasn't happened yet.... Thanks for the positive feedback! Glad you like ipyrad. Tell your friends! Hopefully we will implement the sumstats analysis tool some time in the near future.... Thanks again for your feedback.
Well, now there's an ipyrad.analysis.popgen tool. It's not perfect but it's a start. This will give you a bunch of information, still working on organizing it better:
- Average pi, Watterson's theta, and Tajima's D per population
- Per locus values for all these
- Average Fst and Dxy among populations
- Per locus Fst and Dxy among populations
- A file in
workdirper population for per site pi values within each locus
TODO:
- clean up the cookbook notebook
- Test more on real data.
- Verify and ground-truth the calculations
- Verify samples with missing data at a given locus behave themselves (simulated data is too clean)
- Various plotting functions can be imagined
- The way the stats are compiled on the engines and passed back to the main process is super whack (a massive ugly pickled dictionary), consider refactoring to remove ugliness.
- The way the stats are compiled in the main process is ugly and whack and probably slow.
Hi, is it possible to use this tool at the moment? if so, how would one go about doing it? Thanks! and this would be an amazing tool once fully implemented!