patsy
patsy copied to clipboard
Patsy loses DatetimeIndex freq information even if no NA values
(This is probably better described as a pandas bug, see https://github.com/pandas-dev/pandas/issues/21282, but maybe patsy wants to patch this too?)
Reproducible example:
(proceeding from the above code)
import pandas as pd
import patsy
index = pd.DatetimeIndex(start='1990', end='1994', freq='AS')
data = pd.Series([0, 1, 2, 3, 4], name='y', index=index)
print(data.index)
lhs, rhs = patsy.dmatrices('y ~ 1', data={'y':data}, return_type='dataframe')
print(lhs.index)
The first print statement yields:
DatetimeIndex(['1990-01-01', '1991-01-01', '1992-01-01', '1993-01-01',
'1994-01-01'],
dtype='datetime64[ns]', freq='AS-JAN')
Whereas the second yields:
DatetimeIndex(['1990-01-01', '1991-01-01', '1992-01-01', '1993-01-01',
'1994-01-01'],
dtype='datetime64[ns]', freq=None)
This is a consequence of https://github.com/pandas-dev/pandas/issues/21282 as it affects the following function in patsy/missing.py:
def _handle_NA_drop(self, values, is_NAs, origins):
total_mask = np.zeros(is_NAs[0].shape[0], dtype=bool)
for is_NA in is_NAs:
total_mask |= is_NA
good_mask = ~total_mask
# "..." to handle 1- versus 2-dim indexing
return [v[good_mask, ...] for v in values]
when v is the DatetimeIndex, the ellipses cause the index to lose frequency information.