wrong type inference for square terms
Currently, formulas like @formula(y ~ 1 + a * (a + b) + b * b) would fail. We need to explicitly state the square terms like @formula(y ~ 1 + a + (a ^ 2) + (a & b) + b + (b ^ 2)).
It's because now term like a & a would has type InteractionTerm{ContinuousTerm{Float64}} rather than InteractionTerm{Tuple{ContinuousTerm{Float64}, ContinuousTerm{Float64}}}.
Note that currently InteractionTerm{ContinuousTerm{Float64}} can not be properly handled.
Thanks for the report and the careful investigation! I think there's two things going on here.
- As you rightly point out, statsmodels chokes on
a & a. That's an important edge case to consider and I think it is handled appropriately in #183. -
@formula(y ~ 1 + a * (a + b) + b * b)is not and will likely never be equivalent to@formula(y ~ 1 + a + (a ^ 2) + (a & b) + b + (b ^ 2)), and that's because*has a non-standard interpretation in the wilkinson notation that statsmodels.jl is based on.a * bis equivalent toa + b + a&b.
Now that I think about it a bit more, I think what you're trying to do is pun on a & b being data[:a] .* data[:b] when both a and b are continuous, in which case a & a would be teh same as data[:a] .^ 2. But this is only true when modelcols creates a single column for whatever kind of term a becomes. In the other obvious case where a is a categorical term, you're going to end up with a row-wise kronecker product of the contrast coded data, which is going to produce a wildly overcomplete model matrix that GLM etc. will choke on. So in the spirit of taking away as many footguns as possible I think this kind of punning is a bad idea and a&a should always become a.
I'm considering only continuous data. So, a & a means a ^ 2 to me.
The current situation is, for continuous data, a * (a + b) => a*a + a*b => a + a&a + a + b + a&b => a + b + a&a + a&b, and it is exactly what I want! But unfortunately a&a cannot be handled correctly right now.
I see your point that a&a should be translated into a. As a cross-check, I do the following in R:
> df = data.frame(a = 1:3, b = 2:4, y = 3:5)
> lm(y ~ 1 + a * (a + b), df)
Call:
lm(formula = y ~ 1 + a * (a + b), data = df)
Coefficients:
(Intercept) a b a:b
2.000e+00 1.000e+00 NA 2.719e-16
... and the result is: NO a&a term in R.....
so, maybe the "correct" way to specify a square term is to explicitly write it as a ^ 2 .... (disappointing though...)
Yup, if you want a .^ 2 then a ^ 2 is the way to do it. If you want both a and a^2, that starts to sound more like a polynomial basis set...the idea with StatsModels is that users/developers can create their own packages that, when loaded, provide special syntax to do things like that, and a few people have kicked around the idea of making a "basis functions" package that provides that interface.
As a toy example, you might also be interested in looking at the example from the docs about how to create a "poly term" that generates [a, a^2, a^3, ..., a^n] from @formula(0 ~ poly(a, n)), and lets you do things like
julia> fit(LinearModel, @formula(y ~ 1 + poly(a,2) + poly(b,2)), sim_dat)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
y ~ 1 + poly(a, 2) + poly(b, 2)
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.89288 0.181485 4.92 <1e-5 0.532586 1.25317
a^1 2.73324 0.349194 7.83 <1e-11 2.04001 3.42648
a^2 -1.0114 1.34262 -0.75 0.4531 -3.67684 1.65404
b^1 0.214424 0.136868 1.57 0.1205 -0.0572944 0.486142
b^2 3.15133 0.0811794 38.82 <1e-59 2.99016 3.31249
──────────────────────────────────────────────────────────────────────────
or b * poly(a, 2) -> poly(a,2) + b + poly(a,2) & b which would be equivalent to (b * (a + a^2))