auto_ml FutureWarning + KeyError

Hey there,

I was giving auto_ml a shot but it chokes on a KeyError.

Platform: Win 10, x64 Python 3.6.4 auto_ml 2.9.4

Installed with pip install auto_ml. After the import I get a warning:

C:\Python36\lib\site-packages\deap\tools_hypervolume\pyhv.py:33: ImportWarning: Falling back to the python version of hypervolume module. Expect this to be very slow. "module. Expect this to be very slow.", ImportWarning)

When running train() I end up with this:

Calculating feature responses, for advanced analytics.
C:\Python36\lib\site-packages\sklearn\model_selection\_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-45-23d85ace19e4> in <module>()
----> 1 ml_predictor.train(X_train)

C:\Python36\lib\site-packages\auto_ml\predictor.py in train(***failed resolving arguments***)
    639 
    640         # This is our main logic for how we train the final model
--> 641         self.trained_final_model = self.train_ml_estimator(self.model_names, self._scorer, X_df, y)
    642 
    643         if self.ensemble_config is not None and len(self.ensemble_config) > 0:

C:\Python36\lib\site-packages\auto_ml\predictor.py in train_ml_estimator(self, estimator_names, scoring, X_df, y, feature_learning, prediction_interval)
   1202         # Use Case 1: Super straightforward: just train a single, non-optimized model
   1203         elif (feature_learning == True and self.optimize_feature_learning != True) or (len(estimator_names) == 1 and self.optimize_final_model != True):
-> 1204             trained_final_model = self.fit_single_pipeline(X_df, y, estimator_names[0], feature_learning=feature_learning, prediction_interval=False)
   1205 
   1206         # Use Case 2: Compare a bunch of models, but don't optimize any of them

C:\Python36\lib\site-packages\auto_ml\predictor.py in fit_single_pipeline(self, X_df, y, model_name, feature_learning, prediction_interval)
    854         # That saves a considerable amount of time
    855         if feature_learning == False:
--> 856             self.print_results(model_name, ppl, X_df, y)
    857 
    858         return ppl

C:\Python36\lib\site-packages\auto_ml\predictor.py in print_results(self, model_name, model, X, y)
   1026                 else:
   1027                     feature_responses = self.create_feature_responses(model, X, y, top_features)
-> 1028                 self._join_and_print_analytics_results(feature_responses, sorted_model_results, sort_field='Importance')
   1029             except AttributeError as e:
   1030                 if model_name == 'XGBRegressor':

C:\Python36\lib\site-packages\auto_ml\predictor.py in _join_and_print_analytics_results(self, df_feature_responses, df_features, sort_field)
   1487 
   1488             # Sort by coefficients or feature importances
-> 1489             df_results = df_results[['Feature Name', sort_field, 'Delta', 'FR_Decrementing', 'FR_Incrementing', 'FRD_abs', 'FRI_abs', 'FRD_MAD', 'FRI_MAD']]
   1490         else:
   1491             df_results = df_features

C:\Python36\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2131         if isinstance(key, (Series, np.ndarray, Index, list)):
   2132             # either boolean or fancy integer index
-> 2133             return self._getitem_array(key)
   2134         elif isinstance(key, DataFrame):
   2135             return self._getitem_frame(key)

C:\Python36\lib\site-packages\pandas\core\frame.py in _getitem_array(self, key)
   2175             return self._take(indexer, axis=0, convert=False)
   2176         else:
-> 2177             indexer = self.loc._convert_to_indexer(key, axis=1)
   2178             return self._take(indexer, axis=1, convert=True)
   2179 

C:\Python36\lib\site-packages\pandas\core\indexing.py in _convert_to_indexer(self, obj, axis, is_setter)
   1267                 if mask.any():
   1268                     raise KeyError('{mask} not in index'
-> 1269                                    .format(mask=objarr[mask]))
   1270 
   1271                 return _values_from_object(indexer)

KeyError: "['Delta' 'FR_Decrementing' 'FR_Incrementing' 'FRD_abs' 'FRI_abs' 'FRD_MAD'\n 'FRI_MAD'] not in index"

Where my code is:

from auto_ml import Predictor
import pandas as pd
from sklearn.model_selection import train_test_split

file = "./data/verified.normalized_full.csv"
X = pd.read_csv(file, header=None)
X.columns = ['id', 'title', 'content', 'label']

X = X.drop(['id'], axis=1)
y = X['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

column_descriptions = {
    'title': 'nlp',
    'content': 'nlp',
    'label': 'output'
}

ml_predictor = Predictor(type_of_estimator='classifier', column_descriptions=column_descriptions)
ml_predictor.train(X_train)
ml_predictor.score(X_test, X_test.label)

I'm sorry if this is just a usage error. I'm still trying going through the docs. But as they say "run first" that's what I did ;)

Jan 02 '18 17:01 black-snow

crap, sorry i didn't get to this 'til now.

you're totally right to run code first! this is just a blatant bug on my part that i haven't been able to reproduce. thank you very much for including the full traceback and the script you used to train the data- that's all super helpful.

it shows that it's probably an error with NLP data. i'm shipping a workaround tonight. will work on an actual fix soon.

sorry again for the slow response here. i'm extra bummed because you have one of the use cases that i explicitly designed this package for, and it's really cool to me to see how straightforward the code is that you wrote to train an nlp predictor.

let me know if you have any other feedback! or if you ended up using a different package, i'd love to hear that too- there's a lot these automated ml solutions can learn from each other.

Feb 09 '18 00:02 ClimbsRocks

Hey there, thanks for the reply.

I've created my own scripts for normalization, tf/tf-idf uni-/bigram chi2 selection etc. and finally used Keras with Tensorflow. However I'm still interested in this package - and if it's just to see what architecture and parameters were chosen.

Feb 09 '18 14:02 black-snow