pmlb icon indicating copy to clipboard operation
pmlb copied to clipboard

Missing feature names in Wisconsin dataset

Open trangdata opened this issue 5 years ago • 5 comments

Currently, the features in the Wisconsin Prognostic Breast Cancer dataset do not have names.

The (I think) corresponding dataset on OpenML or even Kaggle seem to have this information. It would be helpful for these feature names to be added.

trangdata avatar Apr 10 '20 15:04 trangdata

@lacava Any idea? Should we update this dataset based on OpenML?

weixuanfu avatar Apr 10 '20 16:04 weixuanfu

Similar issue for the tic-tac-toe dataset. OpenML ref: https://www.openml.org/d/50

trangdata avatar Apr 10 '20 17:04 trangdata

@lacava Any idea? Should we update this dataset based on OpenML?

sure, we just need to make sure they match.

It would be helpful for these feature names to be added.

agreed! if you have bandwidth to submit a PR please do

lacava avatar Apr 10 '20 19:04 lacava

I think it's difficult for outsiders to help because we're not sure where the current datasets came from. I think in general it would also be helpful to add details/metadata for these datasets, e.g. source, meaning of features/classes, as asked here and wished here.

trangdata avatar Apr 10 '20 21:04 trangdata

I think it's difficult for outsiders to help because we're not sure where the current datasets came from.

Unfortunately we are all in that situation with this project. Fortunately, the source of most of these datasets is pretty obvious. If everyone tackled a few datasets and verified their origin (e.g. through a checksum as in here) we could quickly have origin information attached to most of the datasets. The only realistic way I see it happening is if everyone does a few and submits PRs.

I think in general it would also be helpful to add details/metadata for these datasets, e.g. source, meaning of features/classes, as asked here and wished here.

Agreed; that's discussed in issue #13. At the moment, metadata properties for the datasets are extracted for the readme files since PR #11.

lacava avatar Apr 10 '20 22:04 lacava