tskit icon indicating copy to clipboard operation
tskit copied to clipboard

compute covariance matrix and PCA

Open petrelharp opened this issue 6 years ago • 3 comments

To compute a covariance matrix, or do PCA, we currently have to export the genotype matrix. It would be nice to do this using the statistics framework, and totally do-able. It is also possible to find PCs without computing the covariance matrix first. This would be a nice compact but very useful contribution for someone to take on.

petrelharp avatar Jul 30 '19 20:07 petrelharp

I am curious about this, specially how to compute the PCA without the covariance matrix? do you have a paper explaining this?

daniel-trejobanos avatar Nov 01 '19 12:11 daniel-trejobanos

I have not worked out the details, but the basic idea is that the PCs are the eigenvectors of the genetic covariance matrix, and modern iterative methods (like Krylov methods) exist to find the eigenvectors of a matrix A without ever computing A explicitly, but rather finding the result of multiplying A by some random vectors. This paper: https://www.ncbi.nlm.nih.gov/pubmed/26924531 does something like this. In our situation, A = G^T G, where G is the genotype matrix (possibly normalized), and so we can use our general statistics to quickly compute u^T A v for vectors u and v.

That's the general idea. I have not worked out the details, so it's possible there's something very tricky in there, but I'm happy to help work it through.

petrelharp avatar Nov 02 '19 00:11 petrelharp

Also see performance issues in https://github.com/tskit-dev/tskit/issues/1743

hyanwong avatar Jul 05 '23 10:07 hyanwong

I think we can close this issue with 0f7fa20 ?

hanbin973 avatar Mar 17 '25 17:03 hanbin973

🎉 Closed in #3008 and other places!

petrelharp avatar Mar 19 '25 23:03 petrelharp