Interventional samples gives inconsistent results.
@bloebp I am using interventional samples to estimate the effect of Intervention. When I run the below code, I do not get consitent results for any particular instance. This affects reproducibility. How do I make this reproducible for the same data?
from dowhy import gcm causal_model = gcm.InvertibleStructuralCausalModel(G) gcm.auto.assign_causal_mechanisms(causal_model, df_cvd,gcm.auto.AssignmentQuality.BETTER) gcm.fit(causal_model, df_cvd) cf_samples = gcm.interventional_samples(causal_model, intervention_dict, observed_data=X_high_risk_tp)
I did try Random see for your information.
Hi, can you try adding
from dowhy.gcm.util.general import set_random_seed
set_random_seed(0)
i.e.,
from dowhy import gcm
from dowhy.gcm.util.general import set_random_seed
set_random_seed(0)
causal_model = gcm.InvertibleStructuralCausalModel(G)
gcm.auto.assign_causal_mechanisms(causal_model, df_cvd,gcm.auto.AssignmentQuality.BETTER)
gcm.fit(causal_model, df_cvd)
cf_samples = gcm.interventional_samples(causal_model, intervention_dict, observed_data=X_high_risk_tp)
Hello @bloebp
Related to this issue : After starting the app and in the same app running multiple times gcm.attribute_anomalies in one of fastapi endpoints I see that setting random seed helps to reproduce the same contributions values. However when I restart the app I receive a little bit different values for nodes contributions with +-2-5 difference %. The ranking of top root causes roughly remains the same, though I am not sure if this is expected behaviour and what can be done to make it the same after app restart. Maybe it is somehow connected to permutations parameters which I use when calling the function.
Here is the code:
gcm.util.general.set_random_seed(0)
contributions = gcm.attribute_anomalies(
causal_model,
target_node=target_node,
anomaly_samples=samples,
num_distribution_samples=500,
shapley_config=gcm.shapley.ShapleyConfig(approximation_method=gcm.shapley.ShapleyApproximationMethods.PERMUTATION,
num_permutations=50, n_jobs=-1)
Hi,
I am wondering, when you restart the app, does it re-fit the causal_model before you are setting the seed? What could happen is that the model training is bringing in a stochastic factor here. When you put gcm.util.general.set_random_seed(0) as the very first line in the application/code, does this still happen?
-
Nope, it doesn't refit. I first fit causal model with
gcm.fit(causal_model, train_data)in different API-endpoint "train" and savecausal_modelin pickle file on disk. Then I read this pickle file in API-endpoint "anomaly_attribute" and executegcm.attribute_anomaliesinside it. Execution result of this function is reproducible only if I do it in the same app run; -
Yes, it still happens for some reason. I'll try to further experiment with it.
BTW, also I've just seen the same behaviour in jupyter notebook, i.e. trained in one cell model, saved it in pickle and then two times restarted jupyter kernel to run separately gcm.attribute_anomalies two times and each time I've got different result. As above, result is reproducible with set_random_seed only if function call is executed multiple times in the same jupyter session.
Though I am still not sure on whether this is an issue only on my side or someone also experienced this. To be fair, if I remove that shapley_config the results from different app runs are (as expected) even closer to each other, so this inconsistency doesn't affect practical application as of now.
@bloebp Anything on the last comment? Even I get similar issue that @nalexus mentioned with Interventional Samples:
BTW, also #I've just seen the same behaviour in jupyter notebook, i.e. trained in one cell model, saved it in pickle and then two times restarted jupyter kernel to run separately gcm.attribute_anomalies two times and each time I've got different result. As above, result is reproducible with set_random_seed only if function call is executed multiple times in the same jupyter session.
Wondering if some dependency behavior has changed regarding setting a random seed.
@nalexus and @PMK1991 Can you provide a (minimal) reproducible code snippet using some generated data (just numpy random data)? I can take a closer look.