ACES icon indicating copy to clipboard operation
ACES copied to clipboard

Duplication of rows when using _ANY_EVENT as a predicate for windows.

Open Oufattole opened this issue 1 year ago • 4 comments

Aces generates duplicate rows when I define windows around the _ANY_EVENT (random events) predicate. I would expect each row to be unique. Here is some minimial code in colab to reproduce the issue: https://colab.research.google.com/drive/1qER-KA3o6jU3i8StdsNFZrLrdaQ6FA7l?usp=sharing

The TLDR is that on this simple dataframe of three rows: image

I get this output dataframe from aces: image

And this output dataframe has only 3 unique rows: image

I've seen the issue on a slightly larger test dataset of 62 rows where Aces outputs around 152 rows but there are only 62 unique rows.

Oufattole avatar Jul 21 '24 22:07 Oufattole

This has been reported to be occurring on a more real-world case with a readmission predictor config. However, I'm not able to reproduce this issue locally. So, for now, the primary task of this issue (which is on @Oufattole) is to provide a test case that we can reproduce on up to date code, so we can begin to iterate on it.

mmcdermott avatar Aug 22 '24 16:08 mmcdermott

@mmcdermott were you able to reproduce the issue with @Oufattole's original example locally? I seem to still be able to on my end with the most updated version (main branch code). The result returns 5 rows when it should be the 3 per the screenshots

Could you link the readmission predictor config you mentioned? If this real-world case no longer has this issue, maybe it is something particular about this example?

justin13601 avatar Aug 22 '24 20:08 justin13601

See the PR I pushed @justin13601 which adds a test based on his example that passes. Though maybe I did something wrong in my setup?

mmcdermott avatar Aug 22 '24 21:08 mmcdermott

To update here -- my test case was wrong, this issue is present.

I think I know the issue though.

In our algorithm, we identify the start and end boundaries of each window in a possible realization of a config in a patient's record, and we often do joins across the recursion calls by subject_id and *_anchor_timestamp to match different windows together over the iterations. But, this fails to consider the idea that the start window of one branch might correspond to the end window or another branch in an inappropriate manner (e.g., for a config that models to admission and discharge, in one possible extraction the end of the config--the discharge--could conflict with the start of a subsequent extraction if the patient were admitted on the same day as they were discharged).

mmcdermott avatar Aug 22 '24 23:08 mmcdermott

@Oufattole I think we just pushed a fix to main for this. No new release yet, but our test case that we added for your use now passes, just as a heads up.

mmcdermott avatar Aug 24 '24 00:08 mmcdermott