Cannot reproduce pipeline results with sklearn pipeline
For my data, I got the best pipeline by running TPOT training using the following parameters:
from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5,
population_size=100,
verbosity=2,
n_jobs=-1,random_state=1)
The best pipeline was given as:
Best pipeline: MLPClassifier(GaussianNB(Binarizer(input_matrix, threshold=0.0)), alpha=0.001, learning_rate_init=0.001)
TPOTClassifier(generations=5, n_jobs=-1, random_state=1, verbosity=2)
The best CV score I achieved was 0.822
Using the ensemble provided above I trained an ensemble pipeline using sklearn as:
base_model = GaussianNB()
meta_model = MLPClassifier(random_state=1,
learning_rate_init=0.001,
alpha=0.001)
ensemble = StackingClassifier(estimators=[('base_model', base_model),
('meta_model', meta_model)],
final_estimator=meta_model,
n_jobs=-1)
The score I get from this is 0.79
Can you tell me why I getting different scores when all my parameters are same?
The manual pipeline is not exactly identical to the TPOT output. It is missing the Binarizer step.
Also, TPOT wraps internal classifiers in a StackingEstimator. This will pass through its inputs in addition to its predictions. (https://github.com/EpistasisLab/tpot/blob/master/tpot/builtins/stacking_estimator.py).
Going off memory, I believe this is what the TPOT output would be equivalent to:
step1 = Binarizer(threshold=0.0)
base_model = StackingEstimator(GaussianNB())
meta_model = MLPClassifier(random_state=1,
learning_rate_init=0.001,
alpha=0.001)
ensemble = sklearn.pipeline.Pipeline(estimators=[('step1',step1),
('base_model', base_model),
('meta_model', meta_model)],
final_estimator=meta_model,
n_jobs=-1)
The binarized transforms the data -> transformed data -> GaussianNB -> transformed data + predictions -> MLPclassifier