polyRAD icon indicating copy to clipboard operation
polyRAD copied to clipboard

Support pedigreed populations

Open lvclark opened this issue 5 years ago • 3 comments

I'd like to make a new pipeline that uses pedigree information. Genotype estimates of parents and offspring will iteratively influence genotype priors of parents and offspring. Even for biparental populations, this could perform much better than the existing pipeline, which doesn't handle segregation distortion well.

I probably won't tackle this until I have added support for multiploid populations (Issue #17), in order to avoid having to rewrite a lot of code after the fact.

If you have a good test dataset for this sort of population, please contact me!

lvclark avatar Jan 20 '21 16:01 lvclark

Leaving some notes here for myself:

There needs to be a way to deal with errors in the pedigree, since greenhouse mixups, wayward pollen, and unexpected self-fertilization are so common. Maybe have some prior that each connection in the pedigree is correct. Then do a Bayesian comparison of the hypothesis that the pedigree is correct vs. the hypothesis that the individual is just a random individual in the population.

Alternatively, get a set of inter-individual distances using read depth ratios, and let the user interactively identify pedigree errors.

For missing parents, we can simply add individuals with zero read depth.

lvclark avatar Sep 15 '21 13:09 lvclark

All individuals start with even priors, then as information is added across the pedigree, priors get multiplied by the new information and normalized to sum to one.

The unit of analysis should be a single pair of parents and their offspring. Have a list that indicates the sample names for parents and offspring for each family. Then for each marker and each family, we need to jointly estimate the probability of both parent genotypes at the same time, using what we already know about parent and offspring genotypes. For each ploidy combination, have a list already set up for every possible parental genotype combination, listing the possible progeny genotypes as well. The probability of a given genotype combination being the true one is the product of the probability of each parent being that genotype, and the probability of each offspring having a genotype that is possible under that cross (ignoring expected genotype frequencies, because we could have segregation distortion!). Then that goes back to inform the priors of individuals; basically the probability of each genotype under each parental genotype combination, weighted by the probability of the parental genotype combination.

So in essence

  1. Estimate individual genotype posterior probabilities under even priors
  2. Estimate probabilities of parental genotype combinations, given parent and offspring genotype posterior probabilities
  3. Update individual priors based on probabilities of parental genotype combinations
  4. Re-estimate individual genotype posterior probabilities under the new priors

Perform as many iterations as the maximum number of generations between individuals, or find some other way to make sure grandparents are influenced by grandchildren genotypes etc.

lvclark avatar Oct 03 '21 16:10 lvclark

The internal Rcpp function ThirdDimProd may be helpful.

lvclark avatar Oct 09 '21 15:10 lvclark