python-tf-idf icon indicating copy to clipboard operation
python-tf-idf copied to clipboard

Similarities between documents and query may be >1

Open hrs opened this issue 9 years ago • 2 comments

The README claims that similarities between documents and queries shouldn't be greater than 1. However:

table = tfidf.tfidf()
table.addDocument("foo", ["alpha", "bravo", "charlie", "delta", "echo", "foxtrot", "golf", "hotel"])
table.addDocument("bar", ["alpha", "bravo", "charlie", "india", "juliet", "kilo"])
table.addDocument("baz", ["kilo", "lima", "mike", "november"])
print table.similarities (["alpha", "bravo", "charlie", "india"])

Yields [['foo', 0.5625], ['bar', 1.0416666666666665], ['baz', 0.0]]. Whoops!

This is happening because the query isn't being normalized. The ranking of results should still be correct, but it'd be better if we normalized it so we can make guarantees about the output.

hrs avatar Mar 21 '16 19:03 hrs

I meet the same problem, please solve it, thanks.

tianye2856 avatar Feb 26 '18 08:02 tianye2856

what is the solution you guys did it to solve it

shanalikhan avatar Apr 08 '18 18:04 shanalikhan