[BUG] ValuationCorrelation produces NaNs in ranking link ratios
Describe the bug The most recent version of scipy produces NaNs when ranking link ratios in the initialization of a ValuationCorrelation object. The Mack valuation correlation test requires link ratios to be ranked for each pair of development periods, but when using the package, only the first column gets ranked because it's the only one fully populated.
The following output uses the Mack 97 data set, and is an intermediate calculation from the init() function represented by the m1 variable:
Input triangle:
12-24 24-36 36-48 48-60 60-72 72-84 84-96 96-108 108-120
1991 1.649840 1.319023 1.082332 1.146887 1.195140 1.112972 1.033261 1.002902 1.009217
1992 40.424528 1.259277 1.976649 1.292143 1.131839 0.993397 1.043431 1.033088 NaN
1993 2.636950 1.542816 1.163483 1.160709 1.185695 1.029216 1.026374 NaN NaN
1994 2.043324 1.364431 1.348852 1.101524 1.113469 1.037726 NaN NaN NaN
1995 8.759158 1.655619 1.399912 1.170779 1.008669 NaN NaN NaN NaN
1996 4.259749 1.815671 1.105367 1.225512 NaN NaN NaN NaN NaN
1997 7.217235 2.722886 1.124977 NaN NaN NaN NaN NaN NaN
1998 5.142117 1.887433 NaN NaN NaN NaN NaN NaN NaN
1999 1.721992 NaN NaN NaN NaN NaN NaN NaN NaN
Ranks:
array([[[[ 1., nan, nan, nan, nan, nan, nan, nan, nan],
[ 9., nan, nan, nan, nan, nan, nan, nan, nan],
[ 4., nan, nan, nan, nan, nan, nan, nan, nan],
[ 3., nan, nan, nan, nan, nan, nan, nan, nan],
[ 8., nan, nan, nan, nan, nan, nan, nan, nan],
[ 5., nan, nan, nan, nan, nan, nan, nan, nan],
[ 7., nan, nan, nan, nan, nan, nan, nan, nan],
[ 6., nan, nan, nan, nan, nan, nan, nan, nan],
[ 2., nan, nan, nan, nan, nan, nan, nan, nan]]]])
I believe the culprit is a change to the signature of scipy.stats.rankdata(). There is a new parameter called nan_policy which specifies how NaNs should be handled if they appear in the inputs:
Prior to this change, there was no parameter, so I would assume the default way of handling the ranking was to omit the NaNs.
To Reproduce
import chainladder as cl
import pandas as pd
pd.read_csv('mack_1997.csv')
mack97 = cl.Triangle(
data=df_xyz,
origin='Accident Year',
development='Calendar Year',
columns=['Case Incurred'],
cumulative=True
)
mack97.valuation_correlation(p_critical=.1, total=False).z_critical.values
You will see the following warning, but upon further inspection you will see that the intermediate variable m1 has an incorrect triangle of link ratio ranks:
RuntimeWarning: All-NaN slice encountered
r, k = function_base._ureduce(a, func=_nanmedian, axis=axis, out=out,
Expected behavior
We should see all link ratio periods ranked (the ones labeled "r" in the image from the Mack 97 paper):
Adding the argument nan_policy='omit' seems to solve the problem:
m1 = xp.apply_along_axis(rankdata, 2, lr.values, nan_policy='omit') * (lr.values * 0 + 1)
m1
Out[30]:
array([[[[ 1., 2., 1., 2., 5., 4., 2., 1., 1.],
[ 9., 1., 7., 6., 3., 1., 3., 2., nan],
[ 4., 4., 4., 3., 4., 2., 1., nan, nan],
[ 3., 3., 5., 1., 2., 3., nan, nan, nan],
[ 8., 5., 6., 4., 1., nan, nan, nan, nan],
[ 5., 6., 2., 5., nan, nan, nan, nan, nan],
[ 7., 8., 3., nan, nan, nan, nan, nan, nan],
[ 6., 7., nan, nan, nan, nan, nan, nan, nan],
[ 2., nan, nan, nan, nan, nan, nan, nan, nan]]]])
Desktop (please complete the following information):
- Numpy Version 1.21.5
- Pandas Version 1.5.3
- Chainladder Version 0.8.14
- Scipy Version 1.10.1
@genedan can you share the dataset? If you have it readily available?
So I figured out that the data shown in the paper, is the raa dataset. The sample calculation is shown on page 46. However, I don't think the package is following the math here, will investigate more.
@genedan if you have anything else that might be helpful to me, such as the calculation done in excel? Please post them here. :)
Hey @kennethshsu , sorry for being so late on this. I had uploaded a csv file called 'mack97' that I put in the data folder, but I just realized it's the same as raa so perhaps we can delete mack97.
The package should follow the calculations, if you could let me know where you found an inconsistency, perhaps I can help out. The image I pasted from page 45 of the paper is a bit confusing because it contains two triangles mashed together.
The triangle produced by the internal variable m1 should match up with the columns with the r_ij headers. Two ways to produce this are to either include the nan_policy='omit' argument to the xp.apply_along_axis() method or to downgrade scipy below 1.10.
Yap, the dataset is totally the raa, we can remove mack97 later.
It's been a while but I think maybe there are two problems with .ValuationCorrelation().
First, is the issue that you described, related to the ranking of the factors.
Second, I think is a bug, which still exists on 0.8.18.
import chainladder as cl
import pandas as pd
xyz = cl.load_sample("xyz")
xyz["Incurred"].valuation_correlation(p_critical=.1, total=False).z_critical
This gets you an error ValueError: Shape of passed values is (1, 10), indices imply (1, 9). I started debugging the code, and that's when I found out that I don't really think the calculation in the package follows Mack's text.
Is this what you see?
Here is my dev branch.