causalml Question - Cross-validation and Impact of overfitting

👋

Is your feature request related to a problem? Please describe. Currently using CausalML package.

I've difficulties understanding what is the impact of over-fitting on ATE. Dive in the package code to see that we are predicting on the same data we train. https://github.com/uber/causalml/pull/511

XGBoost is a popular regressor and widely showcased in the documentation, however it is known to easily overfit (because of boosting).

What is the impact of over-fitting on ATE computation? I don't find any literaturee online about this (it is not related to CausalML 😉 ).

When writing this issue, stepped on this PR Ate pretrain 0506 which is a first step for computing out-of-fold (or in cross-validation) the ATE.

Describe the solution you'd like Documentation to point to relevant literature. Ability to estimate_ate() out-of-fold

Describe alternatives you've considered I started implementing my own out-of-fold ATE with cross-validation on my side

        kf = KFold(n_splits=5, random_state=1, shuffle=True)
        te = []
        for train_index, valid_index in kf.split(final):
            # Can't used named is_train is get_x_y_treatment because different naming in different
            x_train, y_train, treatment_train = self.get_x_y_treatment(
                final.iloc[train_index], True
            )
            x_valid, y_valid, treatment_valid = self.get_x_y_treatment(
                final.iloc[valid_index], False
            )
            self.my_model.fit(x_train, treatment_train, y_train)
            te_valid = self.my_model.predict(x_valid, treatment_valid, y_valid)
            te.append(te_valid)

        te = np.concatenate(te).mean()
        lb, ub = np.nan, np.nan

The simplest would probably be to directly add this method to the package (to keep confidence interval and stuff) with something like this.

class BaseTLearner(...):

    def estimate_outoffold_ate(
        self,
        X_train,
        treatment_train,
        y_train,
        X_valid,
        treatment_valid,
        y_valid,
        p=None,
        bootstrap_ci=False,
        n_bootstraps=1000,
        bootstrap_size=10000,
    ):
        """Estimate the Average Treatment Effect (ATE).

        Args:
            X (np.matrix or np.array or pd.Dataframe): a feature matrix
            treatment (np.array or pd.Series): a treatment vector
            y (np.array or pd.Series): an outcome vector
            bootstrap_ci (bool): whether to return confidence intervals
            n_bootstraps (int): number of bootstrap iterations
            bootstrap_size (int): number of samples per bootstrap
        Returns:
            The mean and confidence interval (LB, UB) of the ATE estimate.
        """
        X_train, treatment_train, y_train = convert_pd_to_np(
            X_train, treatment_train, y_train
        )
        self.fit(X_train, treatment_train, y_train, return_components=True)

        X_valid, treatment_valid, y_valid = convert_pd_to_np(
            X_valid, treatment_valid, y_valid
        )
        te, yhat_cs, yhat_ts = self.predict(
            X_valid, treatment_valid, y_valid, return_components=True
        )

        ate = np.zeros(self.t_groups.shape[0])
        ate_lb = np.zeros(self.t_groups.shape[0])
        ate_ub = np.zeros(self.t_groups.shape[0])

        for i, group in enumerate(self.t_groups):
            _ate = te[:, i].mean()

            mask = (treatment_valid == group) | (treatment_valid == self.control_name)
            treatment_filt = treatment_valid[mask]
            y_filt = y_valid[mask]
            w = (treatment_filt == group).astype(int)
            prob_treatment = float(sum(w)) / w.shape[0]

            yhat_c = yhat_cs[group][mask]
            yhat_t = yhat_ts[group][mask]

            se = np.sqrt(
                (
                    (y_filt[w == 0] - yhat_c[w == 0]).var() / (1 - prob_treatment)
                    + (y_filt[w == 1] - yhat_t[w == 1]).var() / prob_treatment
                    + (yhat_t - yhat_c).var()
                )
                / y_filt.shape[0]
            )

            _ate_lb = _ate - se * norm.ppf(1 - self.ate_alpha / 2)
            _ate_ub = _ate + se * norm.ppf(1 - self.ate_alpha / 2)

            ate[i] = _ate
            ate_lb[i] = _ate_lb
            ate_ub[i] = _ate_ub

        if not bootstrap_ci:
            return ate, ate_lb, ate_ub
        else:
            t_groups_global = self.t_groups
            _classes_global = self._classes
            models_c_global = deepcopy(self.models_c)
            models_t_global = deepcopy(self.models_t)

            logger.info("Bootstrap Confidence Intervals for ATE")
            ate_bootstraps = np.zeros(shape=(self.t_groups.shape[0], n_bootstraps))

            for n in tqdm(range(n_bootstraps)):
                ate_b = self.bootstrap(
                    X_valid, treatment_valid, y_valid, size=bootstrap_size
                )
                ate_bootstraps[:, n] = ate_b.mean()

            ate_lower = np.percentile(
                ate_bootstraps, (self.ate_alpha / 2) * 100, axis=1
            )
            ate_upper = np.percentile(
                ate_bootstraps, (1 - self.ate_alpha / 2) * 100, axis=1
            )

            # set member variables back to global (currently last bootstrapped outcome)
            self.t_groups = t_groups_global
            self._classes = _classes_global
            self.models_c = deepcopy(models_c_global)
            self.models_t = deepcopy(models_t_global)

            return ate, ate_lower, ate_upper

Thoughts on this? Let me know if you think it is a good idea to implement this myself then open a PR 😅

Jun 07 '22 14:06 Jacques-Peeters

Also implemented an XGBoost with default train/valid split and early stopping

"""Custom models."""
from typing import Any

from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor


class XGBRegressorEarlyStopping(XGBRegressor):
    """XGBoost with train/valid split and early stopping.

    Required wrapper to make it work seamlessly with CaulsalML.

    Args:
        XGBRegressor (_type_): XGBoost model.
    """

    def fit(
        self: "XGBRegressorEarlyStopping", x: Any, y: Any, *args: Any, **kwargs: Any
    ) -> None:  # noqa
        """Fit XGBRegressor with train/valid split and early stopping."""
        x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.20)

        super().fit(
            x_train,
            y_train,
            eval_set=[(x_valid, y_valid)],
            verbose=False,
            *args,
            **kwargs
        )

Benchmarked on a proprietary dataset, with T-learner and two different categorical encoding (One-hot-encoding and out-of-fold target encoding) here are the results :

2022-06-08 10:25:44 INFO: Estimate ATE for: TargetEncoding(my_model=BaseTRegressor)
2022-06-08 10:26:44 INFO: Average Treatment Effect (out-of-fold + early_stopping): 0.20 (nan, nan)
2022-06-08 10:26:44 INFO: Estimate ATE for: TargetEncoding(my_model=BaseTRegressor)
2022-06-08 10:27:11 INFO: Average Treatment Effect (out-of-fold + early_stopping): 0.23 (nan, nan)
2022-06-08 10:27:11 INFO: Estimate ATE for: OneHotEncoding(my_model=BaseTRegressor)
2022-06-08 10:28:08 INFO: Average Treatment Effect (out-of-fold + early_stopping): 0.21 (nan, nan)
2022-06-08 10:28:08 INFO: Estimate ATE for: OneHotEncoding(my_model=BaseTRegressor)
2022-06-08 10:28:28 INFO: Average Treatment Effect (out-of-fold + early_stopping): 0.22 (nan, nan)

There are differences but are they significant?

Jun 08 '22 10:06 Jacques-Peeters

This is a great question. I'm sure that the effect of overfitting on CATE computation depends on the specific meta-learner that you are interested in. For example, the X-learner paper discusses the issue briefly in the context of the T-learner.

In your issue, you're talking about CATE computation, but your code suggests that you're actually talking about ATE computation. Is that correct?

What is the main use case for the proposed out-of-sample ATE prediction? Recall that the way in which models are currently evaluated is by making out-of-sample CATE predictions and using a metric such as AUUC to evaluate the models. This implicitly takes into account overfitting since we would expect overfitted meta-learners to yield poorer CATE predictions.

Oct 04 '22 00:10 t-tte

👋 @t-tte , thank you for your input. Back from holidays I can answer now :)

you're talking about CATE computation, but your code suggests that you're actually talking about ATE computation

=> Sorry I meant ATE (updated my comment), but overfitting problems applies similarly to both ATE and CATE or am I missing something?

I'm a bit afraid of papers cherry-picking the right dataset. I'm not convinced. X-learner seems to add a form of regularization which outperforms T-leaner. Why not simply improving T-learner by doing an out-of-fold ATE (aka cross-validation)? I feel like we are comparing a linear regression to a ridge regression. Indeed ridge is an improvement but the correct way to assess predictive performance is in both case to rely on cross validation.

Recall that the way in which models are currently evaluated is by making out-of-sample

=> Looking at the code I see that we fit_predict on the same dataset and that cross-validation is not used. https://github.com/uber/causalml/blob/master/causalml/inference/meta/tlearner.py#L235

What is the main use case for the proposed out-of-sample ATE prediction

=> Having more accurate ATE and CATE estimations.

Oct 10 '22 16:10 Jacques-Peeters

Could you please give an example of the kind of use case that you're solving? "More accurate ATE" can mean a lot of different things. For example, are you dealing with experimental or observational data?

Oct 14 '22 19:10 t-tte