Error message ValueError: constrained design matrix is not full rank: 7 8
Hello,
currently, I am using diffxpy for my differential analysis and tried using two factors for my formula_loc = "time_point" and "sample". My data consist of 2 time points (juvenile & adult) and 7 samples for those two time points. But when I run the code, I got the error code as following:
test = de.test.wald(
data=adata_lcpm_1,
formula_loc="~ 1 + time_point + sample",
coef_to_test="time_point",
factor_loc_totest=["time_point", "sample"]
)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-52-76f667c65eff> in <module>
----> 1 test = de.test.wald(
2 data=adata_lcpm_1,
3 formula_loc="~ 1 + time_point + sample",
4 coef_to_test="time_point",
5 factor_loc_totest=["time_point", "sample"]
~/anaconda3/envs/diffxpy/bin/diffxpy/diffxpy/testing/tests.py in wald(data, factor_loc_totest, coef_to_test, formula_loc, formula_scale, as_numeric, init_a, init_b, gene_names, sample_description, dmat_loc, dmat_scale, constraints_loc, constraints_scale, noise_model, size_factors, batch_size, backend, train_args, training_strategy, quick_scale, dtype, **kwargs)
645
646 # Build design matrices and constraints.
--> 647 design_loc, design_loc_names, constraints_loc, term_names_loc = constraint_system_from_star(
648 dmat=dmat_loc,
649 sample_description=sample_description,
~/anaconda3/envs/diffxpy/bin/diffxpy/diffxpy/testing/utils.py in constraint_system_from_star(dmat, sample_description, formula, as_numeric, constraints, return_type)
264 as_categorical = True
265
--> 266 return glm.data.constraint_system_from_star(
267 dmat=dmat,
268 sample_description=sample_description,
~/anaconda3/envs/diffxpy/bin/batchglm/batchglm/data.py in constraint_system_from_star(dmat, sample_description, formula, as_categorical, constraints, return_type)
248 if cmat is None:
249 if np.linalg.matrix_rank(dmat) != dmat.shape[1]:
--> 250 raise ValueError(
251 "constrained design matrix is not full rank: %i %i" %
252 (np.linalg.matrix_rank(dmat), dmat.shape[1])
ValueError: constrained design matrix is not full rank: 7 8
I have found a similar issue here, but it was resolved by using as_numeric parameter. Meanwhile, the 'sample' factor is categorical and thus can't be resolved by that method. Could you help me to resolve this problem? Thank you!
I posted this beforehand in the tutorial github, while it should be here.
Hi @faniafeby, could you post the unique rows of your sample description? ie adata_lcpm_1.obs[["time_point", "sample"]].drop_dupicates()? Likely there is confounding between time and sample.
Hi @davidsebfischer , below is the unique rows of my dataset:
After the removal of the duplicates, this table only shows 4 rows out of my n_obs × n_vars = 19330 × 16709. I do agree that there may be confounding between time and sample. So does it mean that I can't use both of the factors together in one run? Thanks!
Yes, you have to think about what you want to model - the time effect or the time effect while reducing the between sample variance. if you want to do the latter, a trick to run GLMs is to change your setup to
time point, sample, rep
p16, S1, R1
p16, S2, R2
p16, S3, R3
adult, S4, R1
and fit ~1+time+rep+rep:time, which regresses out the variation between R1, R2, R3
Because my purpose is the latter, so I should make a new obs to represent the sample and time point combination, and then run the diffxpy as mentioned?
Thank you for the help!
you can just add the rep col into the .obs, you dont have to recreate it!
Same issue. Is there a way to generalize this trick if I have 8 samples for young and 8 samples for old groups (16 unique groups in total)? I think it more resembles the case with embedded effects.