langid.py icon indicating copy to clipboard operation
langid.py copied to clipboard

the text “Hello China" is detected to 'it'

Open gaowenxin95 opened this issue 5 years ago • 6 comments

when l detect ”Hello China" print(langid.classify(”Hello China")) the result : ('it', -37.309250354766846) @Paczesiowa @pquentin @martinth @jnothman @saffsd

gaowenxin95 avatar Sep 07 '20 09:09 gaowenxin95

This can happen on short texts, try a longer one

pquentin avatar Sep 07 '20 10:09 pquentin

This can happen on short texts, try a longer one

thanks

l try a sentence contain five words still detect wrong like this:"hello China you are great"

('it', -31.29085063934326)

when contain six word like this "hello China you are my sunshine" its right

('en', -49.038776874542236)

another like this "hello China hello China hello China " its wrong

('it', -27.979979038238525)

l would like to know how many words should l try at least in the sentence? @pquentin @martinth @jnothman

gaowenxin95 avatar Sep 08 '20 02:09 gaowenxin95

I am dealing with the same issue. In my case, inputting larger pieces of text is no problem, but I want to know what increase of text volume increases the reliability in which extent. Moreover, does it have to be a real text, or is a bunch of words from the language also fine? Lastly, I wonder what the returned negative coefficient says about the reliability of the translation. I couldn't find information about what this number actually means.

Many thanks in advance.

KoenVanDuin avatar Nov 29 '20 07:11 KoenVanDuin

Try my fastlid: pip install fastlid

Fast and accurate, dependent on fasttext though (Windows systems without a C compiler can use fasttext*,whl available at https://www.lfd.uci.edu/~gohlke/pythonlibs/) .

fastlid also tries to imitate two of langid's functionalities.

ffreemt avatar Aug 05 '21 15:08 ffreemt

Having the same issue. The text Our fifth module explains some key calculus skills is detected as 'no' though it have 8 words. In another example, the text (with 4 words) Discover some angle relationships is detectesd as 'sw' but when I changed the text to Discover some angle relationships between them (with 6 words) then it is detected as 'en' as expected.. So what is the minumum word we need to detect?

yuviabhi avatar Sep 07 '21 05:09 yuviabhi

+1

everdrone avatar Apr 17 '22 19:04 everdrone