string_grouper icon indicating copy to clipboard operation
string_grouper copied to clipboard

raise StringLengthException if vectoriser is applied to strings that …

Open gw00207 opened this issue 4 years ago • 6 comments

…are not greater than or equal in length to ngram_size

gw00207 avatar Sep 01 '21 13:09 gw00207

Thanks @gw00207

Looks good. Eventually @Bergvca (the maintainer) will need to approve and pull this.

One question though: what happens in those cases in which there are string values with length greater than or equal to ngram_size to begin with but are later trimmed to a length less than ngram_size after cleanup (such as punctuation removal, for example)?

One such string could be 'se,', which for ngram_size=3 and regex=',' would pass your validation test as it is now but would still raise the ValueError exception.

ParticularMiner avatar Sep 01 '21 14:09 ParticularMiner

@ParticularMiner i have updated the test and code but obviously adding this regex replacement has a computational cost, interested what you think

gw00207 avatar Sep 01 '21 15:09 gw00207

@gw00207

obviously adding this regex replacement has a computational cost, interested what you think

Good point. If indeed this validation is computationally costly, then may I suggest you attempt to tackle the problem at its source instead of performing your own validation. What I mean is, you could instead reexamine the traceback log of the original ValueError you obtained to determine which line in string_grouper's code is producing the error, and then place your own error handler there which would, in turn, raise your preferred exception.

Is this alternative acceptable to you?

ParticularMiner avatar Sep 01 '21 16:09 ParticularMiner

@ParticularMiner how is this? I can squash commits if required

gw00207 avatar Sep 03 '21 15:09 gw00207

@gw00207

I've tested it myself and it's perfect!

Though I was surprised by the fact that all strings need to be problematic before a ValueError occurs. Originally, I had thought that just one problematic string would raise an exception. Thanks for bringing that to my attention.

@Bergvca (the maintainer) will let you know whether you need to squash your commits after he approves your PR.

Thanks for this!

ParticularMiner avatar Sep 03 '21 16:09 ParticularMiner

Hi @gw00207 ,

Thanks for your interest in this repo, and taking the time to help improve it!. I understand the need for the PR, and the reasoning of @ParticularMiner to get to this implementation.

However, the try/except clause now catches all value errors that the TfIdfVectorizer's "fit" function might throw. It could be that the string length (after stopword/punctuation removal) is the only one, in that case I can approve this PR. Could you check if there are other reasons that could cause a ValueError in the TfIdfVectorizer.fit() function?

Another question - what do we need the separate "Error" subclass?

Thanks!

Bergvca avatar Sep 20 '21 13:09 Bergvca