potential reason for descripency of methylation result between M run and beta run.
Examplify by chrom 21, although both runs recognized the same dimension of input: beta
cis-QTL mapping: nominal associations for all variant-phenotype pairs
* 655 samples
* 3211 phenotypes
* 76 covariates
* 158903 variants
* checking phenotypes: 3211/3211
* Computing associations
Mapping chromosome chr21
processing phenotype 3211/3211
time elapsed: 3.98 min
* writing output
done.
vs
M
cis-QTL mapping: nominal associations for all variant-phenotype pairs
* 655 samples
* 3211 phenotypes
* 76 covariates
* 158903 variants
* checking phenotypes: 3211/3211
* Computing associations
Mapping chromosome chr21
processing phenotype 3211/3211
time elapsed: 0.23 min
* writing output
The output nominal table is 29857210 for beta but 3173657 (one tenth of beta) fo M The variants that are in the output tables are 112902 out of 158903 for the M
This hints a cis windows problem to me as some snps are not considered overlap with the cis region.
As it turned out, the input file for both run are perfectly identical in terms of number of rows. Therefore their TSS are indeed identical
If the input are all legit, then the problem boils down to the number of rows to be analysis. Could be the same underline reason as this https://github.com/cumc/xqtl-pipeline/issues/427
following exert from the map_norminal function can successfully generate the file with correct size:
igc = genotypeio.InputGeneratorCis(genotype_df, variant_df, phenotype_df, phenotype_pos_df, group_s=None, window=1000000)
for chrom in igc.chrs:
# allocate arrays
n = 0 # number of pairs
if group_s is None:
for i in igc.phenotype_pos_df[igc.phenotype_pos_df['chr'] == chrom].index:
j = igc.cis_ranges[i]
n += j[1] - j[0] + 1
else:
for i in igc.group_s[igc.phenotype_pos_df['chr'] == chrom].drop_duplicates().index:
j = igc.cis_ranges[i]
n += j[1] - j[0] + 1
chr_res = OrderedDict()
chr_res['phenotype_id'] = []
chr_res['variant_id'] = []
chr_res['tss_distance'] = np.empty(n, dtype=np.int32)
chr_res['af'] = np.empty(n, dtype=np.float32)
chr_res['ma_samples'] = np.empty(n, dtype=np.int32)
chr_res['ma_count'] = np.empty(n, dtype=np.int32)
if interaction_df is None:
chr_res['pval_nominal'] = np.empty(n, dtype=np.float64)
chr_res['slope'] = np.empty(n, dtype=np.float32)
chr_res['slope_se'] = np.empty(n, dtype=np.float32)
else:
chr_res['pval_g'] = np.empty(n, dtype=np.float64)
chr_res['b_g'] = np.empty(n, dtype=np.float32)
chr_res['b_g_se'] = np.empty(n, dtype=np.float32)
chr_res['pval_i'] = np.empty([n, ni], dtype=np.float64)
chr_res['b_i'] = np.empty([n, ni], dtype=np.float32)
chr_res['b_i_se'] = np.empty([n, ni], dtype=np.float32)
chr_res['pval_gi'] = np.empty([n, ni], dtype=np.float64)
chr_res['b_gi'] = np.empty([n, ni], dtype=np.float32)
chr_res['b_gi_se'] = np.empty([n, ni], dtype=np.float32)
as shown below
len(np.empty(n, dtype=np.float32))
29857210
which is contradictory to what we observed in the results. The only reason left is that there is some sort of filtering to remove vast majority of this table
There is indeed a filter within the map_nominal function, that is, the function will filter out all the rows whose nominal data is null
m = chr_res_df['pval_nominal'].notnull()
chr_res_df.loc[m, 'pval_nominal'] = 2*stats.t.cdf(-chr_res_df.loc[m, 'pval_nominal'].abs(), dof)
Therefore the number of rows is not necessarily the same as the length of all possible phenotype_snp combinations.
This feature doesn't seem to be very useful as more than once we have encountered result table with pval_nominal == NA. examplify by #417
The code used in this ticket are documented in https://github.com/cumc/brain-xqtl-analysis/blob/main/analysis/Wang_Columbia/mqtl/diag/result_size_descripancy.ipynb
thanks @hsun3163 its great u figured it out. But what exactly is
m = chr_res_df['pval_nominal'].notnull() what does it mean by a null result?
I think this is some behavior of tensorQTL we can discuss with the developers. I can live with results such as effect size beta estimate is 0 with standard error Inf, -- this will give z-score = 0 and two-sided p-value = 1. Simply output NA (not available) does not seem to be a sensible behavior.
Let's keep this ticket open. I'll try to work with you down the road to find some concrete examples and reach out to tensorQTL developers with our suggested behavior.