Add 20 Newsgroups dataset to `linfa-datasets` and use it instead of `winequality` in the `linfa-bayes` example
winequality isn't a good dataset for linfa-bayes, so we need to replace it with something better in the example code. The 20 Newsgroups is already used here. We just need to move it into linfa-datasets.
Should the actual data be added or should the download function be moved to the linfa-datasets lib?
For the record, the news20 weight:
- 15M datasets/data/20news-bydate.tar.gz
- 35M datasets/data/20news-bydate-test
- 54M datasets/data/20news-bydate-train
Is that 15MB of data? We definitely don't want to add the actual data then. Right now linfa-datasets doesn't support downloading datasets, so we'd need to add an API for that similar to scikit-learn. Downloads should be cached after they're extracted.
Yes sorry, that is 15MB. What would be the rust equivalent for the scikit-learn API?
Here's my sketch: In a new module linfa_datasets::fetch, add a type Newsgroups that acts as a builder for the fetch operation. It has the API desired_targets(self, targets: &[&str]) -> Self, data_dir(self, dir: impl AsRef<Path>) -> Self, and download(self, yes: bool) -> Self. It also has a fetch(self) -> Result<Dataset<..>, ...> method that downloads and unpacks the zip file if it's not cached, then fetches the data. desired_targets works the same way it does in the example and defaults to fetching everything. data_dir determines where the data is cached (I'm not sure whether it should default to a local path or a shared directory like in scikit-learn). download, when set to false, will throw an error instead of performing a download when the data is not cached (defaults to true).
I think a good set of sketches for this might also be the existing mnist and cifar-ten crates? It looks like the Newsgroups code is pretty similar to the curl code used in those. I don't have time before the end of the week to look into this, but I'll try to spend some time on it this weekend; I've really enjoyed working on datasets before, and would definitely be interested in making them more accessible/flexible for Linfa.