dowhy icon indicating copy to clipboard operation
dowhy copied to clipboard

Interventional samples gives inconsistent results.

Open PMK1991 opened this issue 1 year ago • 7 comments

@bloebp I am using interventional samples to estimate the effect of Intervention. When I run the below code, I do not get consitent results for any particular instance. This affects reproducibility. How do I make this reproducible for the same data?

from dowhy import gcm causal_model = gcm.InvertibleStructuralCausalModel(G) gcm.auto.assign_causal_mechanisms(causal_model, df_cvd,gcm.auto.AssignmentQuality.BETTER) gcm.fit(causal_model, df_cvd) cf_samples = gcm.interventional_samples(causal_model, intervention_dict, observed_data=X_high_risk_tp)

PMK1991 avatar Mar 22 '25 08:03 PMK1991

I did try Random see for your information.

PMK1991 avatar Mar 22 '25 08:03 PMK1991

Hi, can you try adding

from dowhy.gcm.util.general import set_random_seed

set_random_seed(0)

i.e.,

from dowhy import gcm
from dowhy.gcm.util.general import set_random_seed

set_random_seed(0)

causal_model = gcm.InvertibleStructuralCausalModel(G)
gcm.auto.assign_causal_mechanisms(causal_model, df_cvd,gcm.auto.AssignmentQuality.BETTER)
gcm.fit(causal_model, df_cvd)
cf_samples = gcm.interventional_samples(causal_model, intervention_dict, observed_data=X_high_risk_tp)

bloebp avatar Mar 24 '25 15:03 bloebp

Hello @bloebp

Related to this issue : After starting the app and in the same app running multiple times gcm.attribute_anomalies in one of fastapi endpoints I see that setting random seed helps to reproduce the same contributions values. However when I restart the app I receive a little bit different values for nodes contributions with +-2-5 difference %. The ranking of top root causes roughly remains the same, though I am not sure if this is expected behaviour and what can be done to make it the same after app restart. Maybe it is somehow connected to permutations parameters which I use when calling the function.

Here is the code:

gcm.util.general.set_random_seed(0)

contributions = gcm.attribute_anomalies(
                                        causal_model, 
                                        target_node=target_node, 
                                        anomaly_samples=samples,
                                        num_distribution_samples=500,                                      
 
shapley_config=gcm.shapley.ShapleyConfig(approximation_method=gcm.shapley.ShapleyApproximationMethods.PERMUTATION, 
                                                                                 num_permutations=50, n_jobs=-1)

nalexus avatar Mar 28 '25 14:03 nalexus

Hi,

I am wondering, when you restart the app, does it re-fit the causal_model before you are setting the seed? What could happen is that the model training is bringing in a stochastic factor here. When you put gcm.util.general.set_random_seed(0) as the very first line in the application/code, does this still happen?

bloebp avatar Mar 28 '25 21:03 bloebp

  1. Nope, it doesn't refit. I first fit causal model with gcm.fit(causal_model, train_data) in different API-endpoint "train" and save causal_model in pickle file on disk. Then I read this pickle file in API-endpoint "anomaly_attribute" and execute gcm.attribute_anomalies inside it. Execution result of this function is reproducible only if I do it in the same app run;

  2. Yes, it still happens for some reason. I'll try to further experiment with it.

BTW, also I've just seen the same behaviour in jupyter notebook, i.e. trained in one cell model, saved it in pickle and then two times restarted jupyter kernel to run separately gcm.attribute_anomalies two times and each time I've got different result. As above, result is reproducible with set_random_seed only if function call is executed multiple times in the same jupyter session.

Though I am still not sure on whether this is an issue only on my side or someone also experienced this. To be fair, if I remove that shapley_config the results from different app runs are (as expected) even closer to each other, so this inconsistency doesn't affect practical application as of now.

nalexus avatar Mar 31 '25 20:03 nalexus

@bloebp Anything on the last comment? Even I get similar issue that @nalexus mentioned with Interventional Samples:

BTW, also #I've just seen the same behaviour in jupyter notebook, i.e. trained in one cell model, saved it in pickle and then two times restarted jupyter kernel to run separately gcm.attribute_anomalies two times and each time I've got different result. As above, result is reproducible with set_random_seed only if function call is executed multiple times in the same jupyter session.

PMK1991 avatar Apr 05 '25 08:04 PMK1991

Wondering if some dependency behavior has changed regarding setting a random seed.

@nalexus and @PMK1991 Can you provide a (minimal) reproducible code snippet using some generated data (just numpy random data)? I can take a closer look.

bloebp avatar Apr 07 '25 15:04 bloebp