pmlb icon indicating copy to clipboard operation
pmlb copied to clipboard

Questions about parity5_plus_5

Open amueller opened this issue 2 years ago • 5 comments

Would it be possible to get a description of the parity5_plus_5 dataset? There's several things that are confusing about it for me. First, there are some duplicate rows, which seems odd. The rows count from 0 to 1023 in binary, and there are 1124 rows in the dataset, meaning there are 100 duplicate rows.

Also, I'm not sure I understand the name of the dataset. The equation for the class label seems to be

data['class'] == data[['Bit_2', 'Bit_3', 'Bit_4', 'Bit_6', 'Bit_8']].sum(axis=1) % 2

but I'm not sure what the intuition behind this is or how it relates to the name. I assume there's some simple binary formula behind this, but I don't immediately see it. Or is it just referring to the fact that the other five bits don't influence the outcome?

amueller avatar Oct 10 '23 18:10 amueller

@ryanurbs do you happen to know the equation for this dataset?

lacava avatar Oct 23 '23 13:10 lacava

I think the explanation is actually just that there's a subset of 5 bits whose parity is computed and the other bits are ignored. but I'm still confused by the duplication of some rows.

amueller avatar Oct 23 '23 18:10 amueller

@lacava @amueller I'm looking into getting a definitive answer to your question. We received this dataset from a colleague.

ryanurbs avatar Oct 23 '23 19:10 ryanurbs

@lacava @amueller I found a published description of the parity5+5 problem here: https://sci2s.ugr.es/keel/pdf/algorithm/congreso/liu-3.pdf

You are indeed correct that only 5 of the features are relevant (Bits 2,3,4,6,8) and the other 5 are randomly generated. The underlying predictive pattern in this dataset is that if there are an even number of zeros across those features, then the outcome is 1, otherwise 0. I'm not sure why there are extra redundant rows in this dataset, as there should be 1024 unique rows as described in the above paper as well. I'm not certain of the exact origins of this particular dataset so it might not be possible to track down where the extra rows came from, but you might just remove the redundant rows depending on what experiment you are looking to run. The name parity5+5 comes from the fact that this dataset is basically the original parity5 problem with 5 irrelevant features added to it.

ryanurbs avatar Oct 23 '23 20:10 ryanurbs

@ryanurbs thank you for the explanation. Interesting to know that the published version only has 1024 rows, so this might have been some processing mix-up along the way. Feel free to close. I was asking for openml.org where we might decide to drop the duplicate rows in a new version of the dataset.

amueller avatar Oct 23 '23 21:10 amueller