feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

The `switch` argument in `DropHighPSIFeatures` should be removed

Open sancholp opened this issue 9 months ago • 5 comments

Describe the bug In DropHighPSIFeatures it is stated that the PSI metric is not symmetric, and hence a boolean argument switch is introduced. The PSI however is symmetric by definition, so this argument should be removed.

PSI(a, b) = PSI(b, a) by definition, as it is a symmetrized version of the KL divergence. See p. 85 of Kullback & Leibler 1951.

sancholp avatar Apr 30 '25 08:04 sancholp

Thanks for raising this issue adn linking the resource. I'd have to read the paper before making a decision.

@gverbock @glevv are you perhaps more familiar than me on this topic?

solegalli avatar May 01 '25 13:05 solegalli

Yes, this is correct. PSI is one of the symmetrical forms of KLD, so there is no need in switch argument.

glevv avatar May 02 '25 15:05 glevv

To compute the PSI you need to define bins (for example based on the decile of the population). In PSI(a,b), a is selected to define the bins. Using the switch argument, b will be selected. If a and b have different distributions, I expect PSI(a,b) will not be equal to PSI(b,a) because the binning is different. Give me a couple of days to think about it. Happy to remove the switch if it has no impact.

gverbock avatar May 04 '25 09:05 gverbock

I see.

In this case, we may need to re-word the docs and user guide, because we are saying that the PSI is not symmetric, and if I understand correctly, what's not symmetric is how we calculate it due to the binning.

I'll take a look at the original article when I get the change and give it another thought.

Thank you so much for your contributions!

solegalli avatar May 06 '25 01:05 solegalli

In that case the wording is misleading. The PSI metric is defined to be symmetric. In the implementation here you make the decision of binning with respect to a (and hence the symmetry breaks). I propose you remove the argument altogether as it makes it seem that the PSI is asymmetric. Instead, explain the binning process and the implications of it.

sancholp avatar May 14 '25 12:05 sancholp

But if the formula implemented by the estimator is asymmetric, because of the binning, then it makes sense to keep the switch argument imo. Perhaps the explanation of the binning formula, and the asymmetry property, should be added more clearly to the docstring.

fkiraly avatar Jul 03 '25 14:07 fkiraly