feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

Allow discretisers to output bins as user defined strings instead of integers or boundaries.

Open PeterPirog opened this issue 3 years ago • 5 comments

I suggest some new functionality in class

class feature_engine.discretisation.EqualFrequencyDiscretiser(variables=None, q=10, return_object=False, return_boundaries=False)

now q is integer type, but it can e defined as list with strings like: q=['very_low','low','medium','high','very_high']

Number of bins equals: len(q) The result is object not integer with values from list so:

'very_low' type object instead of 0 type int 'low' type object instead of 1 type int 'medium' type object instead of 2 type int 'high' type object instead of 3 type int 'very_high" type object instead of 4 type int

PeterPirog avatar Feb 22 '22 07:02 PeterPirog

Hi @PeterPirog

Thank you for the suggestion.

If I understand this correctly, independently of how we implement it, what you would like to see is the option to name the intervals with bespoke strings. These strings would be entered by the user.

Is this correct?

The logic to return integers is that we already have a value that can be used by a machine learning model.

Users found that the integers didn't give them enough information about the actual size of the interval, so they requested the option to rename the intervals with the interval boundaries.

Renaming with arbitrary strings is more complex. Would the user want to give the same name to all the intervals in all the variables they disretize, assuming that they discretize 5 variables at a time? Or would the user have to call the discretizer 5 times to pass the different interval names? in which case, the power of the discretizer is a bit lost.

What would be potential scenarios in which we would like to have strings instead of integers or interval boundaries? could you give us some examples?

Thank

solegalli avatar Feb 22 '22 18:02 solegalli

@solegalli My goal is to visualize similarity and correlation between categorical and numerical values. Step 1: change numerical values to bins (categorical values) Step 2: create text values created by column name + category name Step 3: merge all feature columns into single text feature Step 4: make tokenization of text and create words embeding as vectors (embedings are trained in model) Step 5: make PCA embeding vectors compresion from n-dimensions to 3 dimensions and visualize its on page https://projector.tensorflow.org/ Step 6: Show dependencies between categorical and numerical values on the same 3D plot

Sample code: https://github.com/PeterPirog/github_stack_scripts/blob/main/01_featureengine_equalfrequency.ipynb

PeterPirog avatar Feb 22 '22 20:02 PeterPirog

Thank you so much for the detailed explanation @PeterPirog

I've not come across with this type of analysis before.

I reckon you need the specific label names for the analysis, but from a practical perspective, would this not be possible with the current integer encoded interval names?

Before committing to change the API, I would like to understand a bit more, to what extent this is something that would be of use for the wider user base. We could wait and see if this issue has some support from other users.

solegalli avatar Feb 24 '22 01:02 solegalli

@solegalli Of course there is no problem with conversion integer to string labels manualy. I think there is no reason to make big changes in API, I have some idea how to make compatibility.

feature_engine.discretisation.EqualFrequencyDiscretiser(variables=None, q=10, return_object=False, return_boundaries=False)

: int list, dict, default=10 Desired number of equal frequency intervals / bins.

Example: q=['Range0_name','Range1_name', ....., 'RangeN_name']

  • all variables has the same names of ranges
q={'Feature1':['Range0_name','Range1_name', ....., 'RangeN_name'],
'Feature2':['Range0_name','Range1_name', ....., 'RangeM_name'],
'Feature3':['Range0_name','Range1_name', ....., 'RangeK_name'],
'other':5}

-for Feature1 - use name of ranges ['Range0_name','Range1_name', ....., 'RangeN_name'] -for Feature2 - use name of ranges ['Range0_name','Range1_name', ....., 'RangeM_name'] -for Feature3 - use name of ranges ['Range0_name','Range1_name', ....., 'RangeK_name'] -for other features use 5 bins - integer output

PeterPirog avatar Feb 24 '22 12:02 PeterPirog

If we were to make this change, I would prefer to leave the functionality of q as it is now, and add an additional parameter, called labels, that defaults to None, where the user could pass a list or dictionary. We would have to do this for the equal frequency, equal width and arbitrary discretizer.

But, as I said, I would like to wait a few months and see if this issue has some support from the community before committing to change the API.

I hope this makes sense.

Thank you for the suggestion and details on the use case and implementation.

Sole

solegalli avatar Feb 24 '22 18:02 solegalli