StatsModels.jl wrong type inference for square terms

please read this.

Currently, formulas like @formula(y ~ 1 + a * (a + b) + b * b) would fail. We need to explicitly state the square terms like @formula(y ~ 1 + a + (a ^ 2) + (a & b) + b + (b ^ 2)).

It's because now term like a & a would has type InteractionTerm{ContinuousTerm{Float64}} rather than InteractionTerm{Tuple{ContinuousTerm{Float64}, ContinuousTerm{Float64}}}.

Note that currently InteractionTerm{ContinuousTerm{Float64}} can not be properly handled.

Oct 08 '21 13:10 githubtomtom

Thanks for the report and the careful investigation! I think there's two things going on here.

As you rightly point out, statsmodels chokes on a & a. That's an important edge case to consider and I think it is handled appropriately in #183.
@formula(y ~ 1 + a * (a + b) + b * b) is not and will likely never be equivalent to @formula(y ~ 1 + a + (a ^ 2) + (a & b) + b + (b ^ 2)), and that's because * has a non-standard interpretation in the wilkinson notation that statsmodels.jl is based on. a * b is equivalent to a + b + a&b.

Now that I think about it a bit more, I think what you're trying to do is pun on a & b being data[:a] .* data[:b] when both a and b are continuous, in which case a & a would be teh same as data[:a] .^ 2. But this is only true when modelcols creates a single column for whatever kind of term a becomes. In the other obvious case where a is a categorical term, you're going to end up with a row-wise kronecker product of the contrast coded data, which is going to produce a wildly overcomplete model matrix that GLM etc. will choke on. So in the spirit of taking away as many footguns as possible I think this kind of punning is a bad idea and a&a should always become a.

Oct 08 '21 15:10 kleinschmidt

I'm considering only continuous data. So, a & a means a ^ 2 to me. The current situation is, for continuous data, a * (a + b) => a*a + a*b => a + a&a + a + b + a&b => a + b + a&a + a&b, and it is exactly what I want! But unfortunately a&a cannot be handled correctly right now.

I see your point that a&a should be translated into a. As a cross-check, I do the following in R:

> df = data.frame(a = 1:3, b = 2:4, y = 3:5)
> lm(y ~ 1 + a * (a + b), df)

Call:
lm(formula = y ~ 1 + a * (a + b), data = df)

Coefficients:
(Intercept)            a            b          a:b  
  2.000e+00    1.000e+00           NA    2.719e-16

... and the result is: NO a&a term in R.....

so, maybe the "correct" way to specify a square term is to explicitly write it as a ^ 2 .... (disappointing though...)

Oct 08 '21 17:10 githubtomtom

Yup, if you want a .^ 2 then a ^ 2 is the way to do it. If you want both a and a^2, that starts to sound more like a polynomial basis set...the idea with StatsModels is that users/developers can create their own packages that, when loaded, provide special syntax to do things like that, and a few people have kicked around the idea of making a "basis functions" package that provides that interface.

As a toy example, you might also be interested in looking at the example from the docs about how to create a "poly term" that generates [a, a^2, a^3, ..., a^n] from @formula(0 ~ poly(a, n)), and lets you do things like

julia> fit(LinearModel, @formula(y ~ 1 + poly(a,2) + poly(b,2)), sim_dat)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + poly(a, 2) + poly(b, 2)

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   0.89288    0.181485    4.92    <1e-5    0.532586    1.25317
a^1           2.73324    0.349194    7.83    <1e-11   2.04001     3.42648
a^2          -1.0114     1.34262    -0.75    0.4531  -3.67684     1.65404
b^1           0.214424   0.136868    1.57    0.1205  -0.0572944   0.486142
b^2           3.15133    0.0811794  38.82    <1e-59   2.99016     3.31249
──────────────────────────────────────────────────────────────────────────

or b * poly(a, 2) -> poly(a,2) + b + poly(a,2) & b which would be equivalent to (b * (a + a^2))

Oct 08 '21 17:10 kleinschmidt