linearmodels icon indicating copy to clipboard operation
linearmodels copied to clipboard

cast of non-full rank regression in the first-step regression of FamaMacBeth

Open xuganchen opened this issue 6 years ago • 3 comments

I found that in the first-step regression of FamaMacBeth function (i.e. FamaMacBeth.fit.single function in line 2755-2761 of model.py), if the exog matrix is not full-rank, the coefficients of time-series regression for that group will be NaN. But when I use Stata/SE 15.1, the coefficients of that collinear variables will be 0, and the coefficients of remaining variables will be the same as for normal regression. So I think that the full rank command on line 2757 may need to be reconsidered......

For example, I use grunfeld data set and create a dummy variable. For those obs with year < 1945, I set dummy variable as 0, and for those obs with year >= 1945, I set dummy variable randomly from 0 to 1.

The subset of the data is:

                  invest       mvalue      kstock  Dummy  Dummy*mvalue  Dummy*kstock  const
company year
1       1935  317.600010  3078.500000    2.800000      0           0.0           0.0      1
2       1935  209.899990  1362.400000   53.799999      0           0.0           0.0      1
3       1935   33.099998  1170.600000   97.800003      0           0.0           0.0      1
4       1935   40.290001   417.500000   10.500000      0           0.0           0.0      1
5       1935   39.680000   157.700000  183.200000      0           0.0           0.0      1
6       1935   20.360001   197.000000    6.500000      0           0.0           0.0      1
7       1935   24.430000   138.000000  100.200000      0           0.0           0.0      1
8       1935   12.930000   191.500000    1.800000      0           0.0           0.0      1
9       1935   26.629999   290.600010  162.000000      0           0.0           0.0      1
10      1935    2.540000    70.910004    4.500000      0           0.0           0.0      1

In linearmodels package, we will get:

# est = FamaMacBeth(d['invest'], d[['const', 'mvalue', 'kstock', 'Dummy', 'Dummy*mvalue', 'Dummy*kstock']]).fit(cov_type='kernel', kernel='bartlett', bandwidth=5)
# print(est.all_params.iloc[[0], :])
         const    mvalue    kstock       Dummy  Dummy*mvalue  Dummy*kstock
year
1935        NaN       NaN       NaN         NaN           NaN           NaN

But in Stata/SE 15.1, we will get:

# xtset company year
# asreg invest mvalue kstock dummy dummymvalue dummykstock , fmb newey(5) save(~/Downloads/coff_stata)
_b_mvalue	_b_kstock	_b_dummy	_b_dummymvalue	_b_dummykstock	_Cons	_R2	_adjR2	_TimeVar	_obs
.10249786	-.00199479	0	0	0	.3560334	.865262	.7574716	1935	10

The result from Stata is the same as the result directly using np.linalg.lstsq:

  const    mvalue    kstock  Dummy  Dummy*mvalue  Dummy*kstock
year
1935  0.356033  0.102498 -0.001995    0.0           0.0           0.0

So I think the regression for those groups with non-full rank can still be estimated, instead of getting NaN.

The logic of the judgment in the Stata software seems to be to select a certain variable for regression among all collinear variables, and all coefficients of the others are set to 0. I find that the judgment priority may be as follows:

  1. Non-zero normal variable;
  2. Constant term;
  3. the coefficients of the variables wtih all 0 value are naturally set to 0.

Similar issue #176

The data set I use is available below. grunfeld.csv.zip

xuganchen avatar Mar 21 '20 13:03 xuganchen

Stata and R both use a QR decomposition to decide if one or more regressors are perfectly collinear, and then drop some until the model is estimable. So far I have preferred the Python-esque approach of explicit is better than explicit. This said I have planned to add this as an option since there are some models that are hard to specify in a natural way that without perfectly collinear regressions.

bashtage avatar Mar 21 '20 22:03 bashtage

Great! Adding it as an option is actually a good way. Thank you!

xuganchen avatar Mar 22 '20 05:03 xuganchen

I have another question about the t-test and f-test in FamaMacbeth regression. It seems that the degree of these two tests is T-1 instead of N -k, where T is the number of different periods, N is the number of observations, and k is the number of coefficients.

In line 2810 of FamaMacBeth.fit

df_resid = wy.shape[0] - params.shape[0]

I think it should be

df_resid = all_params.shape[0] - 1

xuganchen avatar Mar 23 '20 08:03 xuganchen