feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

why it’s not possible to use n_jobs = n, like in scikit-learn

Open lukaspistelak opened this issue 1 year ago • 7 comments

Hello, I would like to ask why it’s not possible to use n_jobs = n, like in scikit-learn. I have to select (3-5) features from 700 features, and it takes 2 hours. :/ So, some research is quite hard and slow. :+1:


`tr = RecursiveFeatureAddition(estimator=lgb_model , cv=cv,  scoring= 'average_precision', threshold = 0.002 )

Xt = tr.fit_transform(X, y)`

Thanks

lukaspistelak avatar Dec 19 '24 18:12 lukaspistelak

Hi @lukaspistelak

To select from 700 features, this transformer will train 700 models multiplied by the cross-validation fold. So if you set cv to 5, it will train 700 x 5 models. That might be why it takes so long. LightGBMs are sometimes also slow to train, depending on the number of trees. If the lightGBM takes n_jobs, you should set it there.

It's hard to say a priori if 2 hs is long or short because it will depend on the lightGBM, the size of your data and your available computing resources. If you send more details about how you set up the entire search, I might be able to provide some tips.

Cheers

solegalli avatar Dec 23 '24 09:12 solegalli

Thanks for your response and help! 😊

I tried to add the n_jobs parameter, but it didn’t help. 😕

Here are the LightGBM model parameters I’m using:

params = {
    'objective': 'binary',
    'boosting_type': 'gbdt',
    
    'max_depth': 5 ,        # Smaller tree, less complexity
    
    'lambda_l1': 0.1 ,      # L1 regularization
    'lambda_l2': 0.1 ,      # L2 regularization
    
    # 'learning_rate': 0.1, # Lower learning rate for more gradual training
    'verbose': -1,
    'n_jobs' : 3  # Suppress output
}

num_round = 5
  1. : cv is not 5 , but 45
  2. size of data is cca 3k rows and 700 columns
  3. The features are generated using the same method (a transformer) with different parameters, so I need to select the features with the best parameters.
  4. features with high correlation, can be selected > it doesn't mean that they are without any useful information

lukaspistelak avatar Dec 23 '24 13:12 lukaspistelak

Why do you use 45 as cv? That makes the selector train 700 x 45 models, which is what's making it take so long. I normally use 3 or 5.

Feature-engine relies heavily on sklearn, so we leverage the n_jobs parameter implemented in most sklearn classes. We don't add parallelization on top of the parallelization already contained in sklearn, because most of our routines are not so computationally heavy.

solegalli avatar Dec 25 '24 19:12 solegalli

Why do you use 45 as cv? That makes the selector train 700 x 45 models, which is what's making it take so long. I normally use 3 or 5.

Feature-engine relies heavily on sklearn, so we leverage the n_jobs parameter implemented in most sklearn classes. We don't add parallelization on top of the parallelization already contained in sklearn, because most of our routines are not so computationally heavy.

cv = CombPurgedKFoldCVLocal( n_splits = 10 , n_test_splits= 3 , X_index = X.index )

have 45 combinations

lukaspistelak avatar Dec 28 '24 14:12 lukaspistelak

Interesting! Thank you. I haven't heard of that cross-validation framework before.

RecursiveFeatureAddition will test all features, in your case 700. So if you need to select just 3, it will be a lot of testing for no reason. It will also not select just 3, but the number that satisfy the threshold condition. We could add functionality to make it stop after a number of features has been found in the next round of updates of Feature-engine.

An alternative and similar search would be the SFS from MLXtend, setting the search to forward. You can make that transformer stop after it finds a certain number of features, and therefore, if you stop at 5, it should, in theory, take less time, although the search procedure is not identical to RecursiveFeatureAddition (but more or less).

Another alternative, is to set up a simpler LightGBM. If you check out the theory on successive halving in sklearn, you'll see that with simpler models, you can already find out what works best for the model. So you could train a lightGBM with less estimators and shallower depth, reduce the feature space from 700 to 20, and then increase the complexity of the model and finalize the set of features, if that makes sense.

I hope this helps!

solegalli avatar Dec 30 '24 20:12 solegalli

Thanks ! It helped, i have question is possible use validation dataset as well ? :


from feature_engine.selection import RecursiveFeatureAddition
 
tr = RecursiveFeatureAddition(estimator=lgb_model , cv=cv,  scoring= 'average_precision', threshold = 0.01 )

Xt = tr.fit_transform(X_train, y_train,  eval_set=[(X_test, y_test)], eval_metric= 'average_precision', )

Xt.columns

lukaspistelak avatar Jan 17 '25 18:01 lukaspistelak

Thanks for the suggestion.

If sklearn implements an evaluation set like the one you show, we'll do it as well. Otherwise, we'll keep our API compatible with them.

I know that this is something xgboost and probably lightBGM do, but not sure it's been made part of sklearn at the moment.

solegalli avatar Jan 21 '25 17:01 solegalli