`FixedCombinations` implementation may lose correlations
Problem Description
Under the hood, the FixedCombinations constraint concatenates the columns to produce unique identifiers (and drops the individual columns). This solves the constraint, but in doing so, it may lose correlations that exist between original columns.
Expected behavior
Consider a table of users belong to different cities & states in the US. There is a fixed combinations constraint between the city & state.
| User ID | City | State | Tax Rate |
|---|---|---|---|
| 1 | San Francisco | CA | 7.2% |
| 2 | Los Angeles | CA | 7.5% |
| 3 | Seattle | WA | 2.1% |
| 4 | Seattle | WA | 2.5% |
| 5 | Spokane | WA | 3.1% |
| ... | ... | ... | ... |
There is correlation where CA corresponds to higher tax rates (regardless of which city in CA). The model should be able to capture this.
With FixedCombinations, the model never looks at CA as common feature. Rather it looks at SanFrancisco+CA and LosAngeles+CA as separate categories. (This may be good enough for certain cases, but IMO it's missing a key input that both locations have a commonality.)
Possible Solutions
- Do not drop the
cityandstatecolumns when modeling. The model may synthesize some unexpected output (eg.CA+SanFrancisco, Boston, NY) but that can be fixed later through some logic. - Create a new table to identify a
City, Statepair. For eg.Location ID, City, State. Then use that identifier (Location ID) as a primary key to reference in theUserstable.
Updated title and task since the constraint has been renamed to FixedCombinations. Issue still holds.