Allow conditioning on a variant
Goncalo says that all you need is r between that variant and each other for your data.
Option 1:
Set up an LD server that references the raw data, probably by copying Daniel's HVCF. Having raw data means that security gets complicated, and I think I don't want that.
Option 2: (Probably)
Pre-compute r for all variant pairs within 300kb from the raw data. ie, for each variant, store r for all variants for the next 300kb.
Ways to store it:
- in a tabixed file containing tab-separated
rs until they hit 300kb. Easy to make/store, decent to use, probably just 2 sigfigs, should be smaller thanmatrix.tsv.gz. So, look up the first variant, and iterate throughrs at the same time as iterating throughsites.tsv.gzuntil you hit the second variant. - in sqlite3. Two tables, one of [
id,chr-pos-ref-alt], another of [variant1_id,variant2_id,r]?
Hi Peter,
I have also a pythonic code that uses raw tabix’ed VCFs + numpy linear algebra (which can be speeded up compiling against BLAST/LAPACK).
Daniel
On Feb 23, 2017, at 5:26 PM, Peter VandeHaar [email protected] wrote:
Goncalo says that all you need is r between that variant and each other for your data. (separate for cases and controls?)
Option 1: (I GUESS SO)
Set up an LD server that references the raw data, probably by copying Daniel's HVCF.
Option 2: (NAH)
Pre-compute r for all variant pairs within 300kb from the raw data.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/pheweb/issues/45, or mute the thread https://github.com/notifications/unsubscribe-auth/AMPXBGoFrmcvehCMquL90xJ_wEbL0QtPks5rfgeAgaJpZM4MKlN6.
If some 300kb regions have 10x the average variant density, that could somewhat increase the size of pre-computed correlations. If some have 100x the average (ie, 1% of all variants in 0.01% of the genome), we'll have a problem. Oh well, hopefully that doesn't happen.
Maybe we should only allow conditioning on variants with pval < 1e-4. But if we want to support conditional meta-analysis, then we can't have a restriction like that.
How many variants will there probably be in TOPMed?
Most variants and regions will be 10^-4 for something.
Densest regions are around HLA genes and MHC on chromosome 6.
If we set this up right, we should only need one covariance table for many traits and variants in one PheWeb.
G
Sent from my iPhone
On Mar 6, 2017, at 2:45 PM, Peter VandeHaar [email protected] wrote:
If some 300kb regions have 10x the average variant density, that could somewhat increase the size of pre-computed correlations. If some have 100x the average (ie, 1% of all variants in 0.01% of the genome), we'll have a problem. Oh well, hopefully that doesn't happen.
Maybe we should only allow conditioning on variants with pval < 1e-4. But if we allow conditional meta-analysis, then we can't have a restriction like that.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
(While I'm doing this, remember to also use study-specific LD for showing LD in LZ. I'm not sure how we'll handle meta-analysis LD. Perhaps it'd be fun to toggle 1000G vs study-specific LD, &c?)