FeatureHashing icon indicating copy to clipboard operation
FeatureHashing copied to clipboard

Decouple tf-idf transform on training set from tf-idf of test set

Open aschmu opened this issue 9 years ago • 0 comments

This is not an issue or bug per se with the FeatureHashing package, but I'm wondering if it's possible to train a model using the tf-idf option with the split function using hashed.model.matrix, but without computing the tf-idf transform on the training + test datasets. I'm thinking that in many realistic scenarios, we don't know in advance what words the test set will contain, hence the decoupling of the tf-idf. Normally, at prediction time, one would only keep the words that appeared in the training set and discard the others to construct a tf-idf matrix prior to using the hashing trick.

aschmu avatar May 31 '16 00:05 aschmu