raise StringLengthException if vectoriser is applied to strings that …
…are not greater than or equal in length to ngram_size
Thanks @gw00207
Looks good. Eventually @Bergvca (the maintainer) will need to approve and pull this.
One question though: what happens in those cases in which there are string values with length greater than or equal to ngram_size to begin with but are later trimmed to a length less than ngram_size after cleanup (such as punctuation removal, for example)?
One such string could be 'se,', which for ngram_size=3 and regex=',' would pass your validation test as it is now but would still raise the ValueError exception.
@ParticularMiner i have updated the test and code but obviously adding this regex replacement has a computational cost, interested what you think
@gw00207
obviously adding this regex replacement has a computational cost, interested what you think
Good point. If indeed this validation is computationally costly, then may I suggest you attempt to tackle the problem at its source instead of performing your own validation. What I mean is, you could instead reexamine the traceback log of the original ValueError you obtained to determine which line in string_grouper's code is producing the error, and then place your own error handler there which would, in turn, raise your preferred exception.
Is this alternative acceptable to you?
@ParticularMiner how is this? I can squash commits if required
@gw00207
I've tested it myself and it's perfect!
Though I was surprised by the fact that all strings need to be problematic before a ValueError occurs. Originally, I had thought that just one problematic string would raise an exception. Thanks for bringing that to my attention.
@Bergvca (the maintainer) will let you know whether you need to squash your commits after he approves your PR.
Thanks for this!
Hi @gw00207 ,
Thanks for your interest in this repo, and taking the time to help improve it!. I understand the need for the PR, and the reasoning of @ParticularMiner to get to this implementation.
However, the try/except clause now catches all value errors that the TfIdfVectorizer's "fit" function might throw. It could be that the string length (after stopword/punctuation removal) is the only one, in that case I can approve this PR. Could you check if there are other reasons that could cause a ValueError in the TfIdfVectorizer.fit() function?
Another question - what do we need the separate "Error" subclass?
Thanks!