MAPIE
MAPIE copied to clipboard
Possible data leakage in ENBPI time series tutorial
Is your documentation request related to a problem? Please describe.
While reading the MAPIE tutorial notebooks, I noticed a potential data leakage concern in this example: Tutorial for time series.
When model_params_fit_not_done=True, the same X_train, y_train are used for:
- Hyperparameter tuning via
RandomizedSearchCV(TimeSeriesSplit) - Fitting the final model
- Conformalization with
TimeSeriesRegressor(method="enbpi")
Even though ENBPI uses out-of-bag residuals, wouldn’t this introduce leakage and lead to overly optimistic intervals, since the calibration residuals come from data that influenced model selection?
Describe the solution you'd like
The tutorial should recommend splitting off a separate chronological calibration window and provide a practical example of how to implement this.