cast of non-full rank regression in the first-step regression of FamaMacBeth
I found that in the first-step regression of FamaMacBeth function (i.e. FamaMacBeth.fit.single function in line 2755-2761 of model.py), if the exog matrix is not full-rank, the coefficients of time-series regression for that group will be NaN. But when I use Stata/SE 15.1, the coefficients of that collinear variables will be 0, and the coefficients of remaining variables will be the same as for normal regression. So I think that the full rank command on line 2757 may need to be reconsidered......
For example, I use grunfeld data set and create a dummy variable. For those obs with year < 1945, I set dummy variable as 0, and for those obs with year >= 1945, I set dummy variable randomly from 0 to 1.
The subset of the data is:
invest mvalue kstock Dummy Dummy*mvalue Dummy*kstock const
company year
1 1935 317.600010 3078.500000 2.800000 0 0.0 0.0 1
2 1935 209.899990 1362.400000 53.799999 0 0.0 0.0 1
3 1935 33.099998 1170.600000 97.800003 0 0.0 0.0 1
4 1935 40.290001 417.500000 10.500000 0 0.0 0.0 1
5 1935 39.680000 157.700000 183.200000 0 0.0 0.0 1
6 1935 20.360001 197.000000 6.500000 0 0.0 0.0 1
7 1935 24.430000 138.000000 100.200000 0 0.0 0.0 1
8 1935 12.930000 191.500000 1.800000 0 0.0 0.0 1
9 1935 26.629999 290.600010 162.000000 0 0.0 0.0 1
10 1935 2.540000 70.910004 4.500000 0 0.0 0.0 1
In linearmodels package, we will get:
# est = FamaMacBeth(d['invest'], d[['const', 'mvalue', 'kstock', 'Dummy', 'Dummy*mvalue', 'Dummy*kstock']]).fit(cov_type='kernel', kernel='bartlett', bandwidth=5)
# print(est.all_params.iloc[[0], :])
const mvalue kstock Dummy Dummy*mvalue Dummy*kstock
year
1935 NaN NaN NaN NaN NaN NaN
But in Stata/SE 15.1, we will get:
# xtset company year
# asreg invest mvalue kstock dummy dummymvalue dummykstock , fmb newey(5) save(~/Downloads/coff_stata)
_b_mvalue _b_kstock _b_dummy _b_dummymvalue _b_dummykstock _Cons _R2 _adjR2 _TimeVar _obs
.10249786 -.00199479 0 0 0 .3560334 .865262 .7574716 1935 10
The result from Stata is the same as the result directly using np.linalg.lstsq:
const mvalue kstock Dummy Dummy*mvalue Dummy*kstock
year
1935 0.356033 0.102498 -0.001995 0.0 0.0 0.0
So I think the regression for those groups with non-full rank can still be estimated, instead of getting NaN.
The logic of the judgment in the Stata software seems to be to select a certain variable for regression among all collinear variables, and all coefficients of the others are set to 0. I find that the judgment priority may be as follows:
- Non-zero normal variable;
- Constant term;
- the coefficients of the variables wtih all 0 value are naturally set to 0.
Similar issue #176
The data set I use is available below. grunfeld.csv.zip
Stata and R both use a QR decomposition to decide if one or more regressors are perfectly collinear, and then drop some until the model is estimable. So far I have preferred the Python-esque approach of explicit is better than explicit. This said I have planned to add this as an option since there are some models that are hard to specify in a natural way that without perfectly collinear regressions.
Great! Adding it as an option is actually a good way. Thank you!
I have another question about the t-test and f-test in FamaMacbeth regression. It seems that the degree of these two tests is T-1 instead of N -k, where T is the number of different periods, N is the number of observations, and k is the number of coefficients.
In line 2810 of FamaMacBeth.fit
df_resid = wy.shape[0] - params.shape[0]
I think it should be
df_resid = all_params.shape[0] - 1