tpot icon indicating copy to clipboard operation
tpot copied to clipboard

Reproducibility of the export pipeline

Open Iris7788 opened this issue 3 years ago • 2 comments

Context of the issue

I used tpot to fit my dataset, I got the different export pipeline for each run.

Process to reproduce the issue

The steps for generating exported pipeline, the shape of my dataset was (45, 478).

X_train, X_test, y_train, y_test = \
sklearn.model_selection.train_test_split(X, y, random_state=1,test_size = 0.15)
M1 = TPOTRegressor(generations=10, population_size=40, verbosity=2, random_state=42,n_jobs =-1,cv=5)
M1.fit(X_train, y_train)
M1.export('M1_pipeline.py')

Current result

  1. When I firstly ran, the export pipeline was DecisionTreeRegressor
Generation 1 - Current best internal CV score: -0.6631261058133652
Generation 2 - Current best internal CV score: -0.6631261058133652
Generation 3 - Current best internal CV score: -0.6442071896861652
Generation 4 - Current best internal CV score: -0.5726875496699182
Generation 5 - Current best internal CV score: -0.5726875496699182
Generation 6 - Current best internal CV score: -0.528473933017039
Generation 7 - Current best internal CV score: -0.528473933017039
Generation 8 - Current best internal CV score: -0.528473933017039
Generation 9 - Current best internal CV score: -0.528473933017039
Generation 10 - Current best internal CV score: -0.528473933017039

Best pipeline: DecisionTreeRegressor(Normalizer(input_matrix, norm=max), max_depth=3, min_samples_leaf=10, min_samples_split=9)
  1. When I secondly ran, the export pipeline was ExtraTreesRegressor
Generation 1 - Current best internal CV score: -0.6631261058133652
Generation 2 - Current best internal CV score: -0.6631261058133652
Generation 3 - Current best internal CV score: -0.6593793694494272
Generation 4 - Current best internal CV score: -0.6524528603774085
Generation 5 - Current best internal CV score: -0.636417747633282
Generation 6 - Current best internal CV score: -0.633586381252542
Generation 7 - Current best internal CV score: -0.633586381252542
Generation 8 - Current best internal CV score: -0.633586381252542
Generation 9 - Current best internal CV score: -0.633586381252542
Generation 10 - Current best internal CV score: -0.633586381252542

Best pipeline: ExtraTreesRegressor(LinearSVR(input_matrix, C=1.0, dual=True, epsilon=0.01, loss=epsilon_insensitive, tol=1e-05), bootstrap=False, max_features=0.3, min_samples_leaf=6, min_samples_split=13, n_estimators=100)

Expected result

I would like to have a repeatable and stable export pipeline. The environment version I am using is Python 3.7.12, TPOT 0.11.7.

Thank you very much for the development and maintenance of TPOT.

Iris7788 avatar Sep 18 '22 15:09 Iris7788

If you set n_jobs to 1, reproducibility is more likely. When using parallel processes, exact reproducibility gets challenging since the order of execution has some randomness that is not controllable. It is something we are thinking about

perib avatar Sep 29 '22 17:09 perib

你的邮件我已经收到啦,我会尽快查收哒~

Iris7788 avatar Sep 29 '22 17:09 Iris7788