langid.py Repetition of words causes detection error

When I input strings like 'hello world hello world hello world', langid can't identify it as English text. >>> import langid >>> langid.classify('hello world hello world hello world') ('af', 0.683057652874482)

Jun 30 '16 03:06 joewong826

Thanks for getting in touch! This is an interesting one!

>>> hello world
(array([1426, 1428, 2273, 3948]),)
[1 1 1 1]
('en', -23.719746112823486)
>>> hello world hello world
(array([1339, 1426, 1428, 2273, 3948]),)
[1 2 2 2 2]
('en', -62.565943241119385)
>>> hello world hello world hello world
(array([1339, 1426, 1428, 2273, 3948]),)
[2 3 3 3 3]
('af', -100.6344223022461)
>>> ld 
(array([1339]),)
[1]
('en', 2.9972290992736816)

The issue is that in the training data, the pattern "ld " must be more strongly associated with afrikaans than English, especially when considered with the other patterns in "hello world".

Unfortunately, there's no easy fix for this. Is this a problem in a real use case for you?

Jul 05 '16 23:07 saffsd

Not yet. But my code using langid might process millions of data and texts, and I cannot guarantee there would be no extreme cases like this one. With that being said, I have to admit such circumstances may not even happen. If there's no easy fix, then not fixing it is fine. Thank you for your patience!

Jul 08 '16 09:07 joewong826