step_woe_bin() for binning numeric and factor predictors
Feature
Thanks for your work on this package.
It would be great if a recipe step is added that can bin numeric and factor features by using weight of evidence against a binary outcome. There are functions that do this such as woebin() from {scorecard} or woe.binning() from {woeBinning}. This recipe step will do two things:
- Bin the numeric or factor features (can lump some factor levels together)
- Replace the bin values / factor levels with their woe values (like what
step_woe()currently does)
Example with woebin() from {scorecard}:
library(scorecard)
library(rsample)
data("germancredit")
data_split <- initial_split(germancredit, strata = creditability)
germancredit_train <- training(data_split)
germancredit_test <- testing(data_split)
bins <- woebin(germancredit_train, "creditability")
#> ℹ Creating woe binning ...
#> ✔ Binning on 750 rows and 21 columns in 00:00:02
bins$duration.in.month
#> variable bin count count_distr neg pos posprob woe bin_iv total_iv breaks is_special_values
#> <char> <char> <int> <num> <int> <int> <num> <num> <num> <num> <char> <lgcl>
#> 1: duration.in.month [-Inf,8) 68 0.09066667 60 8 0.1176471 -1.1676052 0.091925740 0.2587426 8 FALSE
#> 2: duration.in.month [8,14) 205 0.27333333 151 54 0.2634146 -0.1809979 0.008618949 0.2587426 14 FALSE
#> 3: duration.in.month [14,16) 53 0.07066667 45 8 0.1509434 -0.8799231 0.044135825 0.2587426 16 FALSE
#> 4: duration.in.month [16,34) 291 0.38800000 197 94 0.3230241 0.1073889 0.004568290 0.2587426 34 FALSE
#> 5: duration.in.month [34,44) 76 0.10133333 46 30 0.3947368 0.4198538 0.019193319 0.2587426 44 FALSE
#> 6: duration.in.month [44, Inf) 57 0.07600000 26 31 0.5438596 1.0231885 0.090300448 0.2587426 Inf FALSE
bins$purpose
#> variable bin count count_distr neg pos posprob woe bin_iv total_iv breaks is_special_values
#> <char> <char> <int> <num> <int> <int> <num> <num> <num> <num> <char> <lgcl>
#> 1: purpose retraining%,%car (used) 83 0.11066667 70 13 0.1566265 -0.8362480 0.06318318 0.1960758 retraining%,%car (used) FALSE
#> 2: purpose radio/television%,%repairs 220 0.29333333 172 48 0.2181818 -0.4289956 0.04902807 0.1960758 radio/television%,%repairs FALSE
#> 3: purpose furniture/equipment%,%business%,%domestic appliances%,%car (new) 395 0.52666667 257 138 0.3493671 0.2254755 0.02791601 0.1960758 furniture/equipment%,%business%,%domestic appliances%,%car (new) FALSE
#> 4: purpose education%,%others 52 0.06933333 26 26 0.5000000 0.8472979 0.05594856 0.1960758 education%,%others FALSE
germancredit_test_woe <- woebin_ply(germancredit_test, bins=bins)
#> ℹ Converting into woe values ...
#> ✔ Woe transformating on 250 rows and 20 columns in 00:00:00
head(germancredit_test_woe)
#> creditability status.of.existing.checking.account_woe duration.in.month_woe credit.history_woe purpose_woe credit.amount_woe savings.account.and.bonds_woe present.employment.since_woe
#> <fctr> <num> <num> <num> <num> <num> <num> <num>
#> 1: good 0.7901394 -0.83910109 -0.73005174 -0.5518446 0.01369884 -0.7833423 -0.34989526
#> 2: bad 0.7901394 0.06578153 -0.05715841 0.3677248 0.31508105 0.2344150 0.06559728
#> 3: good 0.2814901 0.80349524 0.10090617 -0.5518446 0.82320031 0.2344150 0.06559728
#> 4: good -1.2599785 -0.30766736 0.10090617 -0.5518446 -0.33683660 -0.7833423 -0.34989526
#> 5: bad 0.2814901 0.06578153 -0.73005174 0.3677248 0.31508105 0.2344150 0.21868920
#> 6: bad 0.7901394 0.06578153 -0.73005174 0.3677248 0.01369884 0.2344150 -0.34989526
#> installment.rate.in.percentage.of.disposable.income_woe personal.status.and.sex_woe other.debtors.or.guarantors_woe present.residence.since_woe property_woe age.in.years_woe other.installment.plans_woe
#> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0.095061763 -0.09790421 0.0287165 -0.01712104 -0.56976816 -0.1941560 -0.1688382
#> 2: -0.004073325 -0.09790421 0.0287165 -0.01712104 0.49062292 -0.1941560 -0.1688382
#> 3: -0.077291674 -0.09790421 0.0287165 0.14090545 0.09425254 -0.9650809 -0.1688382
#> 4: -0.077291674 -0.09790421 0.0287165 -0.01712104 -0.56976816 -0.1941560 -0.1688382
#> 5: 0.095061763 -0.09790421 0.0287165 0.14090545 0.09425254 -0.1044233 -0.1688382
#> 6: 0.095061763 -0.09790421 0.0287165 -0.01712104 0.09425254 -0.1941560 -0.1688382
#> housing_woe number.of.existing.credits.at.this.bank_woe job_woe number.of.people.being.liable.to.provide.maintenance.for_woe telephone_woe foreign.worker_woe
#> <num> <num> <num> <num> <num> <num>
#> 1: -0.2121896 -0.1009105 -0.02034658 0.01369884 -0.14732471 0
#> 2: 0.4616354 -0.1009105 -0.02034658 -0.06899287 0.09352606 0
#> 3: 0.4944765 0.0534367 0.09858083 0.01369884 -0.14732471 0
#> 4: -0.2121896 0.0534367 -0.00836825 0.01369884 0.09352606 0
#> 5: -0.2121896 -0.1009105 0.09858083 0.01369884 0.09352606 0
#> 6: -0.2121896 -0.1009105 -0.00836825 0.01369884 0.09352606 0
Hello @AndrewKostandy 👋
Before i take a deeper look into this method. Can you answer me what the advantage of it is over manually creating woe compliant variable?
recipe(formula, data) |>
step_discretize(predictors, num_breaks = 2) |>
step_woe(predictors, outcome = vars(outcome))
Hi @EmilHvitfeldt,
The core of the step is the binning of both factor and numeric variables to get the maximum WOE/IV optimized prediction power from the variable. The description of the {woeBinning} library for example states:
An implementation of fine and coarse classing that merges granular classes and levels step by step. And a tree-like approach that iteratively segments the initial bins via binary splits. Both procedures merge, respectively split, bins based on similar weight of evidence (WOE) values and stop via an information value (IV) based criteria
The binning process of woe.binning() from {woeBinning} is described here:
(From https://www.rdocumentation.org/packages/woeBinning/versions/0.1.6/topics/woe.binning)
Similarly, the woebin() function from {scorecard} states:
woebin generates optimal binning for numerical, factor and categorical variables using methods including tree-like segmentation or chi-square merge...
stop_limit: Stop binning segmentation when information value gain ratio less than the 'stop_limit' if using tree method; or stop binning merge when the chi-square of each neighbor bins are larger than the threshold under significance level of 'stop_limit' and freedom degree of 1 if using chimerge method.
Currently, we would need to use step_discretize_cart() or step_discretize_xgb() for numeric predictors and then use step_collapse_cart() for factor variables but these may not be optimizing for woe/IV values specifically. When we're planning to use woe to encode our predictors, it's likely a good idea to bin those predictors optimizing for it.
Technically, only the binning part can be taken from a new bin_woe step and then the encoding of the bins as woe values can be done with step_woe(). So something like:
recipe(formula, data) |>
step_bin_woe(all_predictors(), outcome = vars(creditability)) |> # Would bin both numeric and factor predictors to optimize woe/IV values
step_woe(predictors, outcome = vars(outcome)) # Would replace the bins with their woe values