StatsAPI.jl icon indicating copy to clipboard operation
StatsAPI.jl copied to clipboard

Exposing rank deficient info

Open gragusa opened this issue 3 years ago • 1 comments

I want to discuss the best way to have an API for RegressionModels that are rank deficient. In particular, I would like to have a mechanism allowing packages interfacing through StatsAPI to recover the collinear columns of the model matrix.

Consider this example (I am going to use GLM.jl):

X1 = randn(100)
X2 = randn(100)
y = X1 + X2 + randn(100)
df = DataFrame(y=y,X1=X1,X2=X2, X3=X1, X4=X1.+X2)
lm_1 = lm(@formula(y~X1+X2+X3+X4), df)

Using the StatsAPI, there is no way to recover the three columns of the model matrix that are linearly independent. However, one can do this by looking at the DensePredChol (and eventually at the DensePredQR).

For instance, if I want to calculate X'X using only these columns, I could do

X = modelmatrix(lm_1)
XX = X'X
ch = lm_1.model.pp.chol
rnk = rank(ch)
XX[1:end .∉ [ch.p[rnk+1:end]], 1:end .∉ [ch[rnk+1:end]]]

and similarly, for the columns of the modelmatrix:

Z = X[:, 1:end .∉ [ch.p[rnk+1:end]]] 

I don't know the best way to expose this at the API level. Maybe something like this?

function StatsAPI.collinearindexes(r::RegressionModel) end

The implementation in GLM could then look something like this:

function GLM.collinearindexes(m::LinPredModel) ## of course only for `CholeskyPivoted` (and `QRPivoted` when implemented)
  ch = m.pp.chol
  rnk = rank(ch)
  return ch.p[1:rnk], ch.p[rnk+1:end]
end

Probably not great. Ideas?

P.S.: This is very useful to make packages like CovarianceMatrices play nice with rank-deficient RegressionModels.

gragusa avatar Oct 24 '22 21:10 gragusa

Any comments? Would anybody object to this implementation?

gragusa avatar Dec 20 '22 17:12 gragusa