machine-learning
machine-learning copied to clipboard
Translate Data to ARFF Format
Create script to translate data sets to ARFF format where continuous attributes are binned and missing values are handled (either imputed using expectation-maximization or simply discarded).
I finished the script. I put it in data/src/data/arff/. Things I did that are open for discussion:
- If an instance has a missing value, that instance is discarded (if we want to impute, this has to be done after the net is created. I think..)
- Bins are only created if a feature has 15 or more unique, numeric values
- 4 - 5 bins are created (depends on the number of unique values for the feature)
- Bins are named 'X_Y' where X and Y are the range values of the bin