FLAML [FLAML Crash] [Classification] 0 feature is supplied. Are you using raw Booster interface?

Hey, thanks for the great system.

I am experiencing a crash with the following multi-class classification dataset from Kaggle: spooky-author-identification. I get the following error when I try to fit FLAML:

  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 1524, in fit
    self._search()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 2009, in _search
    self._search_sequential()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 1825, in _search_sequential
    use_ray=False,
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/tune/tune.py", line 382, in run
    result = training_function(trial_to_run.config)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 240, in _compute_with_config_base
    self.fit_kwargs,
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 328, in compute_estimator
    fit_kwargs=fit_kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 267, in evaluate_model_CV
    log_training_metric=log_training_metric, fit_kwargs=fit_kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 196, in get_test_loss
    estimator.fit(X_train, y_train, budget, **fit_kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 515, in fit
    return super().fit(X_train, y_train, budget, **kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 318, in fit
    self._t1 = self._fit(X_train, y_train, **kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 99, in _fit
    model.fit(X_train, y_train, **kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/xgboost/core.py", line 422, in inner_f
    return f(**kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/xgboost/sklearn.py", line 915, in fit
    callbacks=callbacks)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/xgboost/training.py", line 235, in train
    early_stopping_rounds=early_stopping_rounds)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/xgboost/training.py", line 102, in _train_internal
    bst.update(dtrain, i, obj)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/xgboost/core.py", line 1282, in update
    dtrain.handle))
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/xgboost/core.py", line 189, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [04:04:16] ../src/learner.cc:567: Check failed: mparam_.num_feature != 0 (0 vs. 0) : 0 feature is supplied.  Are you using raw Booster interface?
Stack trace:
  [bt] (0) /home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x8d264) [0x7fadf8272264]
  [bt] (1) /home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x1ae8d2) [0x7fadf83938d2]
  [bt] (2) /home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x1bc0ac) [0x7fadf83a10ac]
  [bt] (3) /home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x1a29cb) [0x7fadf83879cb]
  [bt] (4) /home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x58) [0x7fadf82650c8]
  [bt] (5) /home/mossad/anaconda3/envs/kgpip/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fae5a76a9dd]
  [bt] (6) /home/mossad/anaconda3/envs/kgpip/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7fae5a76a067]
  [bt] (7) /home/mossad/anaconda3/envs/kgpip/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2e7) [0x7fae5a7823a7]
  [bt] (8) /home/mossad/anaconda3/envs/kgpip/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12e14) [0x7fae5a782e14]

Here is my script:

df = pd.read_csv('spooky.csv')
X, y = df.drop('author', axis=1), df['author']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
automl_model = AutoML()
automl_model.fit(X_train, y_train, task='classification',
                           time_budget=300,
                           retrain_full='budget',
                           verbose=0, metric='macro_f1')

Your feedback is appreciated.

May 13 '22 08:05 mossadhelali

Could you share a link to download the .csv file? Thanks.

May 13 '22 14:05 sonichi

Thanks @sonichi for your reply. Please find the .csv file of spooky-author-identification.

May 15 '22 03:05 mossadhelali

df = pd.read_csv('spooky.csv') X, y = df.drop('author', axis=1), df['author'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123) automl_model = AutoML() automl_model.fit(X_train, y_train, task='classification', time_budget=300, retrain_full='budget', verbose=0, metric='macro_f1')

Thanks @mossadhelali . The reason is that the raw text feature column is removed in preprocessing. There is no automatic feature engineering for pure text data in flaml now for task='classfiication'. We do support task='seq-classification' for such use cases. This is a big area for improvement and we'd appreciate contributions if it's of your interest. cc @liususan091219

May 15 '22 18:05 sonichi

@mossadhelali thanks for your question. Indeed, you should be using the NLP module with the task = 'seq-classification' option. Here is the sample code for using FLAML to classify the spooky-author-identification dataset:

from flaml import AutoML
import ray
import pandas
ray.init(num_cpus=4, num_gpus=4, ignore_reinit_error=True)

train_valid_dataset = load_dataset('csv', data_files={'train': your_train_csv_file_path})["train"].to_pandas().sample(frac=1, random_state=42)
test_dataset = load_dataset('csv', data_files={'test': your_test_csv_file})["test"].to_pandas()

custom_sent_keys = ["text"]
label_key = "author"

X_train, y_train = train_valid_dataset[custom_sent_keys], train_valid_dataset[label_key]
X_test = test_dataset[custom_sent_keys]

automl_settings = {
        "gpu_per_trial": 1,
        "time_budget": 2400,
        "task": "seq-classification",
        "metric": "accuracy",
        "log_file_name": "seqclass.log",
        "use_ray": {"local_dir": "data/"},
        "n_concurrent_trials": 4
    }

automl_settings["fit_kwargs_by_estimator"] = {
        "transformer": {
            "model_path": model_path,
            "output_dir": "test/data/output/",
            "ckpt_per_epoch": 1,
            "fp16": True,
        }
    }

automl = AutoML()

automl.fit(
        X_train=X_train,
        y_train=y_train,
        **automl_settings
    )
predicted_proba = scipy.special.softmax((automl.predict_proba(X_test))
predicted_id_proba = pandas.DataFrame(test_dataset["id"]).join(pandas.DataFrame(predicted_proba))
predicted_id_proba.to_csv("output.csv", index=False)

I ran the above code on my local machine with 4x NVIDIA V100 GPUs and got a 0.389 multi-class logarithmic loss on the test dataset. Please let us know if this solution works for you.

Jun 03 '22 19:06 liususan091219