Drop High/Drop Low are not working correctly
I get a different answer depending on whether I fit just the work comp LOB vs all LOBs and then slice work comp. These approaches should yield the same answer, but they don't:
import chainladder as cl
clrd = cl.load_sample('clrd').groupby('LOB')['CumPaidLoss'].sum()
# Passes
assert cl.Development(n_periods=5).fit(clrd).ldf_.loc['wkcomp'] == cl.Development(n_periods=5).fit(clrd.loc['wkcomp']).ldf_
# Fails
assert cl.Development(drop_low=2).fit(clrd).ldf_.loc['wkcomp'] == cl.Development(drop_low=2).fit(clrd.loc['wkcomp']).ldf_
I don't think this ever worked. The reason is that we only have one 2D (origin x development) matrix of the entries to be dropped. This matrix is shared across all triangles. When dropping using n_periods, drop_valutation, drop_origin, etc, this is fine. But dropping based on link-ratio values it is not.
Possibly related. I have a triangle with missing values and drop_high drops the missing value rather than ignoring it. My workaround is to list out how many high values to drop, but it is a bit ugly: WtdAvgxHL = cl.Development(average="volume",drop_low=[True]*41,drop_high=[25,23,22,21,20,18,17,15,13,12,10,9,8,6,5,4,3,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]).fit(Triangle).cdf
Thanks for writing in. Yes, this is totally the same issue. I will work on getting this resolved soon.
#332
It would also be great if when you dropped a valuation AND used drop_high/drop_low it would ignore the valuation you already intended to drop.
In the code, drop and drop_high/drop_low are applied in parallel. (interestingly enough, drop_above/drop_below are applied after.) Not a difficult thing to put them in any order, but I wonder if the better approach is to allow the practitioner to apply the precise order they want via a pipeline. Unfortunately it looks like Development transformers are memoryless at the moment (see example below).
genins = cl.load_sample("genins")
genins_dev_1 =cl.Development(n_periods = 2).fit_transform(genins)
genins_dev_2 =cl.Development(n_periods = 4).fit_transform(genins_dev_1)
Using a Pipeline is an interesting pattern. I hadn't thought of it before. In the spirit of keeping estimators as flexible as possible, this seems like a pretty good idea. Then any combo of simultaneous vs ordered omissions could be had.
pull request #370 created with fix to support 4D drop hi/lo.
@jbogaardt I did some quick tests. Also ran your assertions from the top (pass on both). seems to work. but it's quite a few line. Would appreciate additional review.
there is more work to be done in this neighborhood, including adding 4D support for drop above/below and the pipeline support. perhaps we can move those to new threads.
@henrydingliu, this looks great. I'm all for more lines of code if it makes maintenance easier. We should definitely add the asserts into the test suite so that the same bug isn't introduced at a later date.
pull request #375 created with fix to support 4D drop above/below
this assertion fails without this fix
import chainladder as cl
clrd = cl.load_sample('clrd').groupby('LOB')['CumPaidLoss'].sum()
assert cl.Development(drop_above=1.3).fit(clrd).ldf_.loc['wkcomp'] == cl.Development(drop_above=1.3).fit(clrd.loc['wkcomp']).ldf_
@jbogaardt thanks for reviewing and approving. do you want to talk through pipeline support architecture on this issue or open a new one?
New is probably more appropriate.
I was with some non-actuaries this weekend, and I expressed some verbal excitement when I saw this issue is being fixed. Even with their desire to join in, the excitement of a working xHiLow average was not easily transferable. Thanks for doing this!
with 4D support, do we need to modify the existing weight_ property as well?
also, should weight_ be in base instead of development?
yes we should