causalml Added validation methods

Even though there are several ways of validating results (of metalearners) described in the documentation, it's still complicated to estimate how trustworthy the results are.

My question is, would it be possible to perform some sort of hold out validation for the outcome variable. Since, intuitively, metalearners come to a treatment effect estimate by predicting the outcome for both treatment options, it should be possible to predict the outcome for a holdout set. In my understanding, it should then be possible to apply traditional accuracy metrics to evaluate the model.

This will of course not give any insight into the accuracy of the predictions for unobserved outcomes, but it will allow for more confidence in the model if the observed predictions are are at least somewhat accurate.

What do you think of this method? Would it be worth implementing this somehow?

Jan 26 '21 09:01 Thomas9292

Hi @Thomas9292 thanks for using CausalML. I think it definitely makes sense to do the validation. To serve this purpose in our example notebook for meta-learner here Part-B, we did the validation and model performance comparison on a 20% hold-out dataset for different evaluation metrics (e.g., MSE, KL Divergence, AUUC). Please take a look and let me know if you have any questions.

Jan 26 '21 18:01 ppstacy

Hi all, thank you for sharing the CausalML package! I am a complete beginner to coding, trying to finish my thesis, excusing already for a noob question.: Regarding this Part-B of the example notebook. How am I able to calculate these summary metrics (Abs % Error of ATE | MSE | KL Divergence) for non-synthetic data (e.g. Hillstrom Data). I got lost at this point, unfortunately. Plus, is there any way to calculate Qini Values or plot gain/lift/qini curves from UpliftTrees? An answer to that would help me tremendously!

Feb 09 '21 09:02 baendigreydo

Hi @baendigreydo, to calculate those metrics for non-synthetic data unfortunately we don't have the functions now to let you directly use them, but you can reference the code here to generate yourself.

As shown in this notebook you can calculate and plot gain/lift/qini. Please let us know if you have any questions.

Feb 10 '21 21:02 ppstacy

Hi @baendigreydo, let me illustrate to you what I did for reference. Also really curious to hear from @ppstacy if this is the way you think it could be implemented/should be done. The problem is that for the non-synthetic data, only the observed treatment is known. So I did some masking to only compare the treatment for those values.

# Create holdout set
X_train, X_test, t_train, t_test, y_train, y_test_actual = train_test_split(df_confounder, df_treatment, target, test_size=0.2)

# Fit learner on training set
learner = XGBTRegressor()
learner.fit(X=X_train, treatment=t_train, y=y_train)

# Predict the TE for test, and request the components (predictions for t=1 and t=0)
te_test_preds, yhat_c, yhat_t = learner.predict(X_test, t_test, return_components=True)

# Mask the yhats to correspond with the observed treatment (we can only test accuracy for those)
yhat_c = yhat_c[1] * (1 - t_test)
yhat_t = yhat_t[1] * t_test
yhat_test = yhat_t + yhat_c

# Model prediction error
MSE = mean_squared_error(y_test_actual, yhat_test)
print(f"{'Model MSE:':25}{MSE}")

# Also plotted actuals vs. predictions in here, will spare you the code

Feb 11 '21 11:02 Thomas9292

Thanks for your answers. I was able to generate the plots for the meta learners and Trees easily.

I also looked into the solution proposed by you @Thomas9292. I think this is a usable workaround method but the I think there is a logical fault when validation the meta-learner accuracy: to calculate summary tables like shown in this notebook Part B, it uses the function "get_synthetic_preds_holdout" within "get_synthetic_summary_holdout". For these functions to work, the actual treatment effects "tau" are necessary which can be generated within "synthetic_data". As "tau" is not known in real world data, one has to estimate it, just like you @Thomas9292 did it (named "te_test_preds" for test set, respectively "te_train_preds" for train set, to be inserted here ) Based on these new "preds_dict_trains[KEY ACTUAL] = te_train_preds" and "preds_dict_valid[KEY_ACTUAL] = te_test_preds", the summary table can be calculated. The problem here, however, is that these "taus" are assumed to be the ground truth and further models are compared against the model used to generate the "taus", effectively a second order comparison.

I am happy if someone can confirm or even better, refute this issue, as I am puzzleheaded right now.

@Thomas9292 was MSE the only metric you used for model selection? What about Gain/Qini? Would also love to see your code for the plots as I still have a lot to learn.

@ppstacy When I did the above calculation I noticed the following. I suspect that the function "regression_metrics" is not yet complete, since no return is specified. I think it should look like this calculating regression metrics ` reg_metrics=[]

for name, func in metrics.items():
    if w is not None:
        assert y.shape[0] == w.shape[0]
        if w.dtype != bool:
            w = w == 1
        logger.info('{:>8s}   (Control): {:10.4f}'.format(name, func(y[~w], p[~w])))
        logger.info('{:>8s} (Treatment): {:10.4f}'.format(name, func(y[w], p[w])))
    else:
        logger.info('{:>8s}: {:10.4f}'.format(name, func(y, p)))
    reg_metrics.append({name:func(y,p)})

return np.array(reg_metrics)

`

Feb 15 '21 13:02 baendigreydo