embed step_woe_bin() for binning numeric and factor predictors

Feature

Thanks for your work on this package.

It would be great if a recipe step is added that can bin numeric and factor features by using weight of evidence against a binary outcome. There are functions that do this such as woebin() from {scorecard} or woe.binning() from {woeBinning}. This recipe step will do two things:

Bin the numeric or factor features (can lump some factor levels together)
Replace the bin values / factor levels with their woe values (like what step_woe() currently does)

Example with woebin() from {scorecard}:

library(scorecard)
library(rsample)

data("germancredit")
data_split <- initial_split(germancredit, strata = creditability)

germancredit_train <- training(data_split)
germancredit_test <- testing(data_split)

bins <- woebin(germancredit_train, "creditability")
#> ℹ Creating woe binning ...
#> ✔ Binning on 750 rows and 21 columns in 00:00:02

bins$duration.in.month
#>             variable       bin count count_distr   neg   pos   posprob        woe      bin_iv  total_iv breaks is_special_values
#>               <char>    <char> <int>       <num> <int> <int>     <num>      <num>       <num>     <num> <char>            <lgcl>
#> 1: duration.in.month  [-Inf,8)    68  0.09066667    60     8 0.1176471 -1.1676052 0.091925740 0.2587426      8             FALSE
#> 2: duration.in.month    [8,14)   205  0.27333333   151    54 0.2634146 -0.1809979 0.008618949 0.2587426     14             FALSE
#> 3: duration.in.month   [14,16)    53  0.07066667    45     8 0.1509434 -0.8799231 0.044135825 0.2587426     16             FALSE
#> 4: duration.in.month   [16,34)   291  0.38800000   197    94 0.3230241  0.1073889 0.004568290 0.2587426     34             FALSE
#> 5: duration.in.month   [34,44)    76  0.10133333    46    30 0.3947368  0.4198538 0.019193319 0.2587426     44             FALSE
#> 6: duration.in.month [44, Inf)    57  0.07600000    26    31 0.5438596  1.0231885 0.090300448 0.2587426    Inf             FALSE

bins$purpose
#>    variable                                                              bin count count_distr   neg   pos   posprob        woe     bin_iv  total_iv                                                           breaks is_special_values
#>      <char>                                                           <char> <int>       <num> <int> <int>     <num>      <num>      <num>     <num>                                                           <char>            <lgcl>
#> 1:  purpose                                          retraining%,%car (used)    83  0.11066667    70    13 0.1566265 -0.8362480 0.06318318 0.1960758                                          retraining%,%car (used)             FALSE
#> 2:  purpose                                       radio/television%,%repairs   220  0.29333333   172    48 0.2181818 -0.4289956 0.04902807 0.1960758                                       radio/television%,%repairs             FALSE
#> 3:  purpose furniture/equipment%,%business%,%domestic appliances%,%car (new)   395  0.52666667   257   138 0.3493671  0.2254755 0.02791601 0.1960758 furniture/equipment%,%business%,%domestic appliances%,%car (new)             FALSE
#> 4:  purpose                                               education%,%others    52  0.06933333    26    26 0.5000000  0.8472979 0.05594856 0.1960758                                               education%,%others             FALSE

germancredit_test_woe <- woebin_ply(germancredit_test, bins=bins)
#> ℹ Converting into woe values ...
#> ✔ Woe transformating on 250 rows and 20 columns in 00:00:00

head(germancredit_test_woe)
#>    creditability status.of.existing.checking.account_woe duration.in.month_woe credit.history_woe purpose_woe credit.amount_woe savings.account.and.bonds_woe present.employment.since_woe
#>           <fctr>                                   <num>                 <num>              <num>       <num>             <num>                         <num>                        <num>
#> 1:          good                               0.7901394           -0.83910109        -0.73005174  -0.5518446        0.01369884                    -0.7833423                  -0.34989526
#> 2:           bad                               0.7901394            0.06578153        -0.05715841   0.3677248        0.31508105                     0.2344150                   0.06559728
#> 3:          good                               0.2814901            0.80349524         0.10090617  -0.5518446        0.82320031                     0.2344150                   0.06559728
#> 4:          good                              -1.2599785           -0.30766736         0.10090617  -0.5518446       -0.33683660                    -0.7833423                  -0.34989526
#> 5:           bad                               0.2814901            0.06578153        -0.73005174   0.3677248        0.31508105                     0.2344150                   0.21868920
#> 6:           bad                               0.7901394            0.06578153        -0.73005174   0.3677248        0.01369884                     0.2344150                  -0.34989526
#>    installment.rate.in.percentage.of.disposable.income_woe personal.status.and.sex_woe other.debtors.or.guarantors_woe present.residence.since_woe property_woe age.in.years_woe other.installment.plans_woe
#>                                                      <num>                       <num>                           <num>                       <num>        <num>            <num>                       <num>
#> 1:                                             0.095061763                 -0.09790421                       0.0287165                 -0.01712104  -0.56976816       -0.1941560                  -0.1688382
#> 2:                                            -0.004073325                 -0.09790421                       0.0287165                 -0.01712104   0.49062292       -0.1941560                  -0.1688382
#> 3:                                            -0.077291674                 -0.09790421                       0.0287165                  0.14090545   0.09425254       -0.9650809                  -0.1688382
#> 4:                                            -0.077291674                 -0.09790421                       0.0287165                 -0.01712104  -0.56976816       -0.1941560                  -0.1688382
#> 5:                                             0.095061763                 -0.09790421                       0.0287165                  0.14090545   0.09425254       -0.1044233                  -0.1688382
#> 6:                                             0.095061763                 -0.09790421                       0.0287165                 -0.01712104   0.09425254       -0.1941560                  -0.1688382
#>    housing_woe number.of.existing.credits.at.this.bank_woe     job_woe number.of.people.being.liable.to.provide.maintenance.for_woe telephone_woe foreign.worker_woe
#>          <num>                                       <num>       <num>                                                        <num>         <num>              <num>
#> 1:  -0.2121896                                  -0.1009105 -0.02034658                                                   0.01369884   -0.14732471                  0
#> 2:   0.4616354                                  -0.1009105 -0.02034658                                                  -0.06899287    0.09352606                  0
#> 3:   0.4944765                                   0.0534367  0.09858083                                                   0.01369884   -0.14732471                  0
#> 4:  -0.2121896                                   0.0534367 -0.00836825                                                   0.01369884    0.09352606                  0
#> 5:  -0.2121896                                  -0.1009105  0.09858083                                                   0.01369884    0.09352606                  0
#> 6:  -0.2121896                                  -0.1009105 -0.00836825                                                   0.01369884    0.09352606                  0

Jan 12 '25 17:01 AndrewKostandy

Hello @AndrewKostandy 👋

Before i take a deeper look into this method. Can you answer me what the advantage of it is over manually creating woe compliant variable?

recipe(formula, data) |>
  step_discretize(predictors, num_breaks = 2) |>
  step_woe(predictors, outcome = vars(outcome))

Jan 13 '25 18:01 EmilHvitfeldt

Hi @EmilHvitfeldt,

The core of the step is the binning of both factor and numeric variables to get the maximum WOE/IV optimized prediction power from the variable. The description of the {woeBinning} library for example states:

An implementation of fine and coarse classing that merges granular classes and levels step by step. And a tree-like approach that iteratively segments the initial bins via binary splits. Both procedures merge, respectively split, bins based on similar weight of evidence (WOE) values and stop via an information value (IV) based criteria

The binning process of woe.binning() from {woeBinning} is described here:

(From https://www.rdocumentation.org/packages/woeBinning/versions/0.1.6/topics/woe.binning)

Similarly, the woebin() function from {scorecard} states:

woebin generates optimal binning for numerical, factor and categorical variables using methods including tree-like segmentation or chi-square merge...

stop_limit: Stop binning segmentation when information value gain ratio less than the 'stop_limit' if using tree method; or stop binning merge when the chi-square of each neighbor bins are larger than the threshold under significance level of 'stop_limit' and freedom degree of 1 if using chimerge method.

Currently, we would need to use step_discretize_cart() or step_discretize_xgb() for numeric predictors and then use step_collapse_cart() for factor variables but these may not be optimizing for woe/IV values specifically. When we're planning to use woe to encode our predictors, it's likely a good idea to bin those predictors optimizing for it.

Technically, only the binning part can be taken from a new bin_woe step and then the encoding of the bins as woe values can be done with step_woe(). So something like:

recipe(formula, data) |>
  step_bin_woe(all_predictors(), outcome = vars(creditability)) |> # Would bin both numeric and factor predictors to optimize woe/IV values
  step_woe(predictors, outcome = vars(outcome)) # Would replace the bins with their woe values

Jan 13 '25 19:01 AndrewKostandy