Repetition of words causes detection error
When I input strings like 'hello world hello world hello world', langid can't identify it as English text.
>>> import langid
>>> langid.classify('hello world hello world hello world')
('af', 0.683057652874482)
Thanks for getting in touch! This is an interesting one!
>>> hello world
(array([1426, 1428, 2273, 3948]),)
[1 1 1 1]
('en', -23.719746112823486)
>>> hello world hello world
(array([1339, 1426, 1428, 2273, 3948]),)
[1 2 2 2 2]
('en', -62.565943241119385)
>>> hello world hello world hello world
(array([1339, 1426, 1428, 2273, 3948]),)
[2 3 3 3 3]
('af', -100.6344223022461)
>>> ld
(array([1339]),)
[1]
('en', 2.9972290992736816)
The issue is that in the training data, the pattern "ld " must be more strongly associated with afrikaans than English, especially when considered with the other patterns in "hello world".
Unfortunately, there's no easy fix for this. Is this a problem in a real use case for you?
Not yet. But my code using langid might process millions of data and texts, and I cannot guarantee there would be no extreme cases like this one. With that being said, I have to admit such circumstances may not even happen. If there's no easy fix, then not fixing it is fine. Thank you for your patience!