Preserving column order for categoricals
https://groups.google.com/forum/#!topic/pystatsmodels/ZvsyZag3xaw
import patsy
import pandas as pd
cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
patsy.dmatrix("age + educ + married", data=cps)
@josef-pkt: I think this is the issue that you were trying to think of today that involved setting up column order.
For reference, the issue is that patsy does impose some constraints on column order: specifically, it groups together terms so that those which contain the same combination of continuous factors go together, and then within each group it puts lower-order interactions before higher-order interactions. The user request was that they wanted to do a type-I anova, and statsmodels only supported (don't know if this is still true?) type-I anovas where each column was entered from left to right. So they wanted to in particular have some categorical terms, then some continuous terms, then some categorical terms, which violates that "grouping" constraint above.
My current feleing (also expressed more thoughtfully in that thread) is that the best solution is just for statsmodels type-I anova code to support explicit specification of what order you want to enter the terms in. Doing this in patsy is hard because (a) I actually think the current behaviour is nicer for most use cases, so am reluctant to de-optimize user experience in general just to improve type-I anovas (which are almost never the right thing anyway, and rarely used outside of introductory classes), and (b) it's not clear that patsy can fix this entirely, since in general type-I anovas might want almost any ordering of columns, and patsy can't really support that without extreme contortions. Allowing y ~ a + x + b would not be too hard, but y ~ a:b + a:x + a + b:x would be very difficult and intrusive (and it's not even clear how it would work).
Yes, I think that's what I remembered.
I don't really know the details, but there are also other use cases where column order is relevant. One is in handling multicollinearity, where R does pivoting, and statsmodels will also do sequential check for perfect correlation.
For type 1 ANOVA:
I don't think it would be difficult to process the anova sequence in a different order, but I don't know how the user would have to specify terms. AFAIR, anova_lm could loop over the list of terms in a pretty arbitrary rearrangement, but there are no names for the terms, i.e.
term_sequence = ["age", "educ", "married", "educ:married"]
instead of
term_sequence = [0, 2, 1, 3]
I guess column order effects results in multicollinearity cases, but do users actually need fine-grained control over this? I guess if you find a case where they do then post a comment on this bug? :-)
Patsy does provide the ability to look up terms by name, so I guess you should just teach anova_lm to use those? Let me know if there's something on patsy's side that needs doing here...
On Tue, Apr 14, 2015 at 6:53 PM, Josef Perktold [email protected] wrote:
Yes, I think that's what I remembered.
I don't really know the details, but there are also other use cases where column order is relevant. One is in handling multicollinearity, where R does pivoting, and statsmodels will also do sequential check for perfect correlation.
For type 1 ANOVA: I don't think it would be difficult to process the anova sequence in a different order, but I don't know how the user would have to specify terms. AFAIR, anova_lm could loop over the list of terms in a pretty arbitrary rearrangement, but there are no names for the terms, i.e. term_sequence = ["age", "educ", "married", "educ:married"] instead of term_sequence = [0, 2, 1, 3]
— Reply to this email directly or view it on GitHub https://github.com/pydata/patsy/issues/27#issuecomment-93102960.
Nathaniel J. Smith -- http://vorpus.org