xqtl-protocol icon indicating copy to clipboard operation
xqtl-protocol copied to clipboard

potential reason for descripency of methylation result between M run and beta run.

Open hsun3163 opened this issue 3 years ago • 7 comments

Examplify by chrom 21, although both runs recognized the same dimension of input: beta

cis-QTL mapping: nominal associations for all variant-phenotype pairs
  * 655 samples
  * 3211 phenotypes
  * 76 covariates
  * 158903 variants
  * checking phenotypes: 3211/3211
  * Computing associations
    Mapping chromosome chr21
    processing phenotype 3211/3211
    time elapsed: 3.98 min
    * writing output
done.

vs

M

cis-QTL mapping: nominal associations for all variant-phenotype pairs
  * 655 samples
  * 3211 phenotypes
  * 76 covariates
  * 158903 variants
  * checking phenotypes: 3211/3211
  * Computing associations
    Mapping chromosome chr21
    processing phenotype 3211/3211
    time elapsed: 0.23 min
    * writing output

The output nominal table is 29857210 for beta but 3173657 (one tenth of beta) fo M The variants that are in the output tables are 112902 out of 158903 for the M

This hints a cis windows problem to me as some snps are not considered overlap with the cis region.

hsun3163 avatar Oct 05 '22 19:10 hsun3163

As it turned out, the input file for both run are perfectly identical in terms of number of rows. Therefore their TSS are indeed identical

hsun3163 avatar Oct 05 '22 20:10 hsun3163

If the input are all legit, then the problem boils down to the number of rows to be analysis. Could be the same underline reason as this https://github.com/cumc/xqtl-pipeline/issues/427

hsun3163 avatar Oct 05 '22 20:10 hsun3163

following exert from the map_norminal function can successfully generate the file with correct size:

igc = genotypeio.InputGeneratorCis(genotype_df, variant_df, phenotype_df, phenotype_pos_df, group_s=None, window=1000000)
for chrom in igc.chrs:
        # allocate arrays
        n = 0  # number of pairs
        if group_s is None:
            for i in igc.phenotype_pos_df[igc.phenotype_pos_df['chr'] == chrom].index:
                j = igc.cis_ranges[i]
                n += j[1] - j[0] + 1
        else:
            for i in igc.group_s[igc.phenotype_pos_df['chr'] == chrom].drop_duplicates().index:
                j = igc.cis_ranges[i]
                n += j[1] - j[0] + 1

        chr_res = OrderedDict()
        chr_res['phenotype_id'] = []
        chr_res['variant_id'] = []
        chr_res['tss_distance'] = np.empty(n, dtype=np.int32)
        chr_res['af'] =           np.empty(n, dtype=np.float32)
        chr_res['ma_samples'] =   np.empty(n, dtype=np.int32)
        chr_res['ma_count'] =     np.empty(n, dtype=np.int32)
        if interaction_df is None:
            chr_res['pval_nominal'] = np.empty(n, dtype=np.float64)
            chr_res['slope'] =        np.empty(n, dtype=np.float32)
            chr_res['slope_se'] =     np.empty(n, dtype=np.float32)
        else:
            chr_res['pval_g'] =  np.empty(n, dtype=np.float64)
            chr_res['b_g'] =     np.empty(n, dtype=np.float32)
            chr_res['b_g_se'] =  np.empty(n, dtype=np.float32)
            chr_res['pval_i'] =  np.empty([n, ni], dtype=np.float64)
            chr_res['b_i'] =     np.empty([n, ni], dtype=np.float32)
            chr_res['b_i_se'] =  np.empty([n, ni], dtype=np.float32)
            chr_res['pval_gi'] = np.empty([n, ni], dtype=np.float64)
            chr_res['b_gi'] =    np.empty([n, ni], dtype=np.float32)
            chr_res['b_gi_se'] = np.empty([n, ni], dtype=np.float32)

as shown below

len(np.empty(n, dtype=np.float32))
29857210

which is contradictory to what we observed in the results. The only reason left is that there is some sort of filtering to remove vast majority of this table

hsun3163 avatar Oct 05 '22 22:10 hsun3163

There is indeed a filter within the map_nominal function, that is, the function will filter out all the rows whose nominal data is null

                m = chr_res_df['pval_nominal'].notnull()
                chr_res_df.loc[m, 'pval_nominal'] = 2*stats.t.cdf(-chr_res_df.loc[m, 'pval_nominal'].abs(), dof)

Therefore the number of rows is not necessarily the same as the length of all possible phenotype_snp combinations.

This feature doesn't seem to be very useful as more than once we have encountered result table with pval_nominal == NA. examplify by #417

hsun3163 avatar Oct 05 '22 22:10 hsun3163

The code used in this ticket are documented in https://github.com/cumc/brain-xqtl-analysis/blob/main/analysis/Wang_Columbia/mqtl/diag/result_size_descripancy.ipynb

hsun3163 avatar Oct 05 '22 22:10 hsun3163

thanks @hsun3163 its great u figured it out. But what exactly is

m = chr_res_df['pval_nominal'].notnull() what does it mean by a null result?

I think this is some behavior of tensorQTL we can discuss with the developers. I can live with results such as effect size beta estimate is 0 with standard error Inf, -- this will give z-score = 0 and two-sided p-value = 1. Simply output NA (not available) does not seem to be a sensible behavior.

gaow avatar Oct 05 '22 22:10 gaow

Let's keep this ticket open. I'll try to work with you down the road to find some concrete examples and reach out to tensorQTL developers with our suggested behavior.

gaow avatar Oct 05 '22 22:10 gaow