Duplication of rows when using _ANY_EVENT as a predicate for windows.
Aces generates duplicate rows when I define windows around the _ANY_EVENT (random events) predicate. I would expect each row to be unique. Here is some minimial code in colab to reproduce the issue: https://colab.research.google.com/drive/1qER-KA3o6jU3i8StdsNFZrLrdaQ6FA7l?usp=sharing
The TLDR is that on this simple dataframe of three rows:
I get this output dataframe from aces:
And this output dataframe has only 3 unique rows:
I've seen the issue on a slightly larger test dataset of 62 rows where Aces outputs around 152 rows but there are only 62 unique rows.
This has been reported to be occurring on a more real-world case with a readmission predictor config. However, I'm not able to reproduce this issue locally. So, for now, the primary task of this issue (which is on @Oufattole) is to provide a test case that we can reproduce on up to date code, so we can begin to iterate on it.
@mmcdermott were you able to reproduce the issue with @Oufattole's original example locally? I seem to still be able to on my end with the most updated version (main branch code). The result returns 5 rows when it should be the 3 per the screenshots
Could you link the readmission predictor config you mentioned? If this real-world case no longer has this issue, maybe it is something particular about this example?
See the PR I pushed @justin13601 which adds a test based on his example that passes. Though maybe I did something wrong in my setup?
To update here -- my test case was wrong, this issue is present.
I think I know the issue though.
In our algorithm, we identify the start and end boundaries of each window in a possible realization of a config in a patient's record, and we often do joins across the recursion calls by subject_id and *_anchor_timestamp to match different windows together over the iterations. But, this fails to consider the idea that the start window of one branch might correspond to the end window or another branch in an inappropriate manner (e.g., for a config that models to admission and discharge, in one possible extraction the end of the config--the discharge--could conflict with the start of a subsequent extraction if the patient were admitted on the same day as they were discharged).
@Oufattole I think we just pushed a fix to main for this. No new release yet, but our test case that we added for your use now passes, just as a heads up.