FeatureUnion + StackingEstimator causes input data to be duplicated for the rest of the model, increasing computational load and complexity.
TPOT uses FeatureUnion to combined the outputs of multiple operators. However, it is possible for tpot to put in two stacking estimators within a FeatureUnion block. This causes tpot to pass along two identical copies on the dataset into the next operator.
Context of the issue
This increases computational load and complexity, especially for large datasets, with no benefit. It may also have a performance impact on certain models.
Process to reproduce the issue
- User creates TPOT instance
- User calls TPOT
fit()function with training data - TPOT will generate a pipeline as described.
To demonstrate the issue, below is code using a pipeline that was found by tpot.
from sklearn.pipeline import FeatureUnion, Pipeline
from tpot.builtins import StackingEstimator, ZeroCount
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.decomposition import PCA
import numpy as np
p = Pipeline(
[('featureunion', FeatureUnion(transformer_list=[('stackingestimator-1',
StackingEstimator(estimator=RandomForestRegressor(max_features=0.45,
min_samples_leaf=9,
min_samples_split=4))),
('stackingestimator-2',
StackingEstimator(estimator=ExtraTreesRegressor(max_features=0.7500000000000001,
min_samples_leaf=20,
min_samples_split=18)))])), ('stackingestimator-1', StackingEstimator(estimator=SGDRegressor(alpha=0.01, eta0=1.0,
fit_intercept=False, l1_ratio=0.0,
loss='epsilon_insensitive',
penalty='elasticnet', power_t=1.0))), ('pca', PCA(iterated_power=3, svd_solver='randomized')), ('stackingestimator-2', StackingEstimator(estimator=SGDRegressor(alpha=0.001, fit_intercept=False,
l1_ratio=0.0,
loss='epsilon_insensitive',
penalty='elasticnet', power_t=1.0))), ('zerocount', ZeroCount()), ('sgdregressor', SGDRegressor(alpha=0.001, fit_intercept=False, l1_ratio=0.5,
learning_rate='constant', loss='huber', penalty='elasticnet',
power_t=0.1))]
)
X = np.random.rand(5,10)
y = np.random.rand(5)
p.fit(X,y)
xx = [range(10)]
print("Input data " ,xx)
print("After featureUnion ", p.steps[0][1].transform(xx))
Expected result
The data should not be copied over twice.
[Estimator 1 predictions, Estimator 2 predictions, X]
[0.44, .45, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Current result
[Estimator 1 predictions, X, Estimator 2 predictions, X]
[0.44, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, .45, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Possible fix
Here is my idea off the top of my head: Limit featureUnion to selectors, transformers, and at most one classifier or regressor. That way only one copy of the data exists. When more than one classifier or regressor is used, replace the featureUnion with the sklearn stackingclassifier or stackingregressor. These functions similarly allow multiple models to pass along their predictions, but then only pass forward one copy of the dataset.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html
I wanted to add another data replication issue.
The FunctionTransformer module can also be set to exactly copy the input into the next layer. I have generated another pipeline where Several feature unions are stacked with multiple function transformers that are essentially just leading to multiple copies of the data.