EconML icon indicating copy to clipboard operation
EconML copied to clipboard

Recover linear regression results

Open gcasamat opened this issue 5 years ago • 10 comments

I would like to know if it is possible to recover the estimates of a simple linear regression. I would have thought that fitting a LinearDML algorithm with (1) LinearRegression() as the model_y and model_t and (2) setting cv = 1, would reproduce the OLS estimates. Is this correct? Thanks

gcasamat avatar Feb 06 '21 17:02 gcasamat

Yes that is the expected result. Please let us know if you get sth else

vsyrgkanis avatar Feb 06 '21 17:02 vsyrgkanis

Thank you. I will check carefully and tell you if I find some discrepancy.

gcasamat avatar Feb 06 '21 18:02 gcasamat

I compared the LinearDML "strategy" described above with OLS results from StatsModels. The coefficient on my (binary) treatment variable (pro_rcs) is 4.53 with LinearDML whereas it is 0.1172 with sm.OLS

Here is the code:

est_linear = LinearDML(
                model_y = StatsModelsLinearRegression({'method' : 'qr'}),
                model_t = StatsModelsLinearRegression({'method' : 'qr'}),
                cv = 1,
                discrete_treatment = False,
                fit_cate_intercept = True,
                linear_first_stages = False,
                random_state = 123)
est_linear.fit(Y.values.ravel(), T.values.ravel(), X = None, W = data_for_reg[dum_varlist + cont_varlist + month_dum_list + ['cons']])
results = est_linear.const_marginal_effect_inference()
results.summary_frame(alpha = 0.05, value = 0, decimals = 3)
point_estimate	stderr	zstat	pvalue	ci_lower	ci_upper
0	4.53	0.017	267.176	0.0	4.497	4.563
mod = sm.OLS(Y, pd.concat([T,data_for_reg[dum_varlist + cont_varlist + month_dum_list + ['cons']]], axis = 1))
res = mod.fit(method = 'qr')
print(res.summary())
                    OLS Regression Results                            
==============================================================================
Dep. Variable:               log_rate   R-squared:                       0.603
Model:                            OLS   Adj. R-squared:                  0.603
Method:                 Least Squares   F-statistic:                       nan
Date:                Sun, 07 Feb 2021   Prob (F-statistic):                nan
Time:                        16:42:41   Log-Likelihood:                -18385.
No. Observations:               38616   AIC:                         3.677e+04
Df Residuals:                   38615   BIC:                         3.678e+04
Df Model:                           0                                         
Covariance Type:            nonrobust                                         
======================================================================================================
                                      coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
pro_rcs                                0.1172      0.006     19.414      0.000       0.105       0.129
PropertyType_Bed & Breakfast           0.2689      0.012     21.652      0.000       0.245       0.293
PropertyType_Bungalow                 -0.1216      0.019     -6.443      0.000      -0.159      -0.085
PropertyType_Chalet                   -0.1288      0.017     -7.500      0.000      -0.162      -0.095
PropertyType_Condominium              -0.0011      0.012     -0.089      0.929      -0.025       0.022
PropertyType_Guesthouse                0.2735      0.019     14.629      0.000       0.237       0.310
PropertyType_House                     0.0561      0.005     11.278      0.000       0.046       0.066
PropertyType_Other                    -0.0153      0.012     -1.326      0.185      -0.038       0.007
PropertyType_Vacation home             0.0689      0.020      3.425      0.001       0.029       0.108
PropertyType_Villa                     0.2174      0.008     26.974      0.000       0.202       0.233
ListingType_Private room              -0.2652      0.008    -33.330      0.000      -0.281      -0.250
ListingType_Shared room               -1.0105      0.042    -23.825      0.000      -1.094      -0.927
Superhost_Yes                          0.0112      0.007      1.627      0.104      -0.002       0.025
CancellationPolicy_Moderate           -0.0096      0.007     -1.325      0.185      -0.024       0.005
CancellationPolicy_Strict              0.0995      0.006     18.015      0.000       0.089       0.110
CancellationPolicy_Super strict 30     0.3391      0.028     12.124      0.000       0.284       0.394
CancellationPolicy_Super strict 60     0.2566      0.052      4.952      0.000       0.155       0.358
InstantbookEnabled_Yes                -0.0422      0.004     -9.585      0.000      -0.051      -0.034
listing_age                            0.0010      0.000      5.800      0.000       0.001       0.001
Bedrooms                               0.1387      0.003     40.296      0.000       0.132       0.145
Bathrooms                              0.2177      0.004     57.264      0.000       0.210       0.225
MaxGuests                              0.0241      0.002     13.538      0.000       0.021       0.028
NumberofPhotos                         0.0044      0.000     22.379      0.000       0.004       0.005
host_seniority                      5.476e-21   7.71e-22      7.099      0.000    3.96e-21    6.99e-21
ResponseTime                        5.849e-07   1.42e-07      4.130      0.000    3.07e-07    8.62e-07
ResponseRate                          -0.0004      0.000     -2.452      0.014      -0.001   -8.66e-05
OverallRating                          0.0942      0.005     18.196      0.000       0.084       0.104
MinimumStay                            0.0005      0.000      2.041      0.041    2.14e-05       0.001
month_2                               -0.0185      0.020     -0.940      0.347      -0.057       0.020
month_3                               -0.0159      0.018     -0.864      0.388      -0.052       0.020
month_4                                0.0655      0.015      4.277      0.000       0.036       0.096
month_5                                0.1287      0.015      8.629      0.000       0.099       0.158
month_6                                0.2573      0.015     17.689      0.000       0.229       0.286
month_7                                0.4868      0.014     33.991      0.000       0.459       0.515
month_8                                0.5283      0.014     36.958      0.000       0.500       0.556
month_9                                0.2185      0.015     14.780      0.000       0.189       0.247
month_10                               0.0588      0.016      3.772      0.000       0.028       0.089
month_11                              -0.0015      0.018     -0.085      0.932      -0.037       0.034
month_12                              -0.0092      0.018     -0.498      0.618      -0.045       0.027
cons                                   3.2495      0.033     98.208      0.000       3.185       3.314
==============================================================================
Omnibus:                     4627.155   Durbin-Watson:                   0.621
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            35107.433
Skew:                          -0.325   Prob(JB):                         0.00
Kurtosis:                       7.626   Cond. No.                     7.81e+19
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.81e+19. This might indicate that there are
strong multicollinearity or other numerical problems.

gcasamat avatar Feb 07 '21 15:02 gcasamat

I believe you might be having colinearity problems in which case the equivalence I believe breaks. The two methods are breaking times among projection solutions in a different manner (i.e. regularizing differently).

i also tried the experiment with some synthetic data and I am getting the same result. Though I didnt use the method=qr spec. Dont think that would be a problem

vsyrgkanis avatar Feb 07 '21 17:02 vsyrgkanis

Oh that method thing might actually be the problem. I believe the equivalence might still hold if you use pinv (i.e. minimum norm solution). So I would not specify the method attribute.

vsyrgkanis avatar Feb 07 '21 17:02 vsyrgkanis

I tried without specifying the method and obtained the same result from LinearDML :

	point_estimate	stderr	zstat	pvalue	ci_lower	ci_upper
0	4.53	0.017	267.176	0.0	4.497	4.563

I forgot to mention that I obtain the following message after fit (don't know if it is helpful): Co-variance matrix is undertermined. Inference will be invalid!

Yes, I have some multicollinearity issue. That's the reason why I used the 'qr' method in StatsModels. It allows to reproduce the outcome from regressions in Stata. Without specifying this method, I get "crazy" results with StatsModels.

Do you have some suggestion for dealing with this multicollinearity? The only one I am aware of is to drop variables with large VIF. Otherwise, from my experience, using the QR decomposition allows to obtain reasonable coefficient estimates. That's why I used this method.

gcasamat avatar Feb 07 '21 17:02 gcasamat

I would try using lineardml with lassocv (the default) as residualizers. (Or elasticnetcv). That should take care of multicolonearity. Also definitely use cv=3 or 5.

the issue in linear dml is most probably that your residuals are exactly zero due to overfitting and the result is nonsense.

vsyrgkanis avatar Feb 07 '21 18:02 vsyrgkanis

I tried what you suggest and lineardml indeed gives results I believe in.

What I want to do is to compare the outcome of a standard linear regression with the dml approach. And as a consistency check, I wanted to be able to replicate the linear regression outcome with the dml approach (by adopting an appropriate parametrization). Following your comments, this does not seem to be possible, unless I fix the multicollinearity issue in my data.

Many thanks for your help.

gcasamat avatar Feb 07 '21 19:02 gcasamat

Hi @gcasamat Just wanted to follow on this thread. Were you able to recover the equation from linearDML and compare with linear regression results?

Also @vsyrgkanis , I have a doubt. From LinearDML, we can get the coefficients for our model and compare with Linear Regression but in order to create a equation like y = mx + C we need intercept/constant value also. How can I get it from econML?

anurag-ae2024 avatar Dec 02 '24 16:12 anurag-ae2024

@anurag-ae2024 if you use LinearDML with linear first stage models and no X, you're ultimately solving first stage equations like

t = alpha * w + C_t + e1 y = theta * t + beta * w + C_y + e2    = theta * (alpha * w + C_t e1) + beta * w + C_y + e2    = (theta * alpha + beta) * w + (theta*C_t + C_y) + theta * e1 + e2

and then solving for theta in the final stage. So to recover C_y, you need to take the intercept from the first stage y model, which will be theta * C_t + C_y, and subtract theta * C_t (where theta is the result of the final model, and C_t is the intercept from the first-stage t model).

But it's not clear to me why you'd want to do this via DML except as a learning exercise; the power of DML is that you can use more flexible first-stage models when you expect the relationships between the controls and the treatment and outcome to be more complicated than just a linear relationship. If you think everything is linear and you don't want to regularize the coefficients, then there is no need to use DML.

kbattocchi avatar Dec 03 '24 20:12 kbattocchi