pheweb icon indicating copy to clipboard operation
pheweb copied to clipboard

Allow conditioning on a variant

Open pjvandehaar opened this issue 8 years ago • 4 comments

Goncalo says that all you need is r between that variant and each other for your data.

Option 1:

Set up an LD server that references the raw data, probably by copying Daniel's HVCF. Having raw data means that security gets complicated, and I think I don't want that.

Option 2: (Probably)

Pre-compute r for all variant pairs within 300kb from the raw data. ie, for each variant, store r for all variants for the next 300kb.

Ways to store it:

  • in a tabixed file containing tab-separated rs until they hit 300kb. Easy to make/store, decent to use, probably just 2 sigfigs, should be smaller than matrix.tsv.gz. So, look up the first variant, and iterate through rs at the same time as iterating through sites.tsv.gz until you hit the second variant.
  • in sqlite3. Two tables, one of [id, chr-pos-ref-alt], another of [variant1_id, variant2_id, r]?

pjvandehaar avatar Feb 23 '17 22:02 pjvandehaar

Hi Peter,

I have also a pythonic code that uses raw tabix’ed VCFs + numpy linear algebra (which can be speeded up compiling against BLAST/LAPACK).

Daniel

On Feb 23, 2017, at 5:26 PM, Peter VandeHaar [email protected] wrote:

Goncalo says that all you need is r between that variant and each other for your data. (separate for cases and controls?)

Option 1: (I GUESS SO)

Set up an LD server that references the raw data, probably by copying Daniel's HVCF.

Option 2: (NAH)

Pre-compute r for all variant pairs within 300kb from the raw data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/pheweb/issues/45, or mute the thread https://github.com/notifications/unsubscribe-auth/AMPXBGoFrmcvehCMquL90xJ_wEbL0QtPks5rfgeAgaJpZM4MKlN6.

dtaliun avatar Feb 24 '17 01:02 dtaliun

If some 300kb regions have 10x the average variant density, that could somewhat increase the size of pre-computed correlations. If some have 100x the average (ie, 1% of all variants in 0.01% of the genome), we'll have a problem. Oh well, hopefully that doesn't happen.

Maybe we should only allow conditioning on variants with pval < 1e-4. But if we want to support conditional meta-analysis, then we can't have a restriction like that.

How many variants will there probably be in TOPMed?

pjvandehaar avatar Mar 06 '17 19:03 pjvandehaar

Most variants and regions will be 10^-4 for something.

Densest regions are around HLA genes and MHC on chromosome 6.

If we set this up right, we should only need one covariance table for many traits and variants in one PheWeb.

G

Sent from my iPhone

On Mar 6, 2017, at 2:45 PM, Peter VandeHaar [email protected] wrote:

If some 300kb regions have 10x the average variant density, that could somewhat increase the size of pre-computed correlations. If some have 100x the average (ie, 1% of all variants in 0.01% of the genome), we'll have a problem. Oh well, hopefully that doesn't happen.

Maybe we should only allow conditioning on variants with pval < 1e-4. But if we allow conditional meta-analysis, then we can't have a restriction like that.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

abecasis avatar Mar 06 '17 19:03 abecasis

(While I'm doing this, remember to also use study-specific LD for showing LD in LZ. I'm not sure how we'll handle meta-analysis LD. Perhaps it'd be fun to toggle 1000G vs study-specific LD, &c?)

pjvandehaar avatar Mar 17 '17 21:03 pjvandehaar