govarnam icon indicating copy to clipboard operation
govarnam copied to clipboard

Thanglish Word Issue

Open josephmiller2000 opened this issue 4 years ago • 8 comments

Snap_Shot_02419

enna = என்ன en = என் na = ன

josephmiller2000 avatar Aug 24 '21 16:08 josephmiller2000

I don't know Tamil, so can't really fix this problem. This must be a problem with the transliteration scheme. Pinging @Kishore96in since he's familiar with this.

The "How To Write A Word" is actually reverse transliteration. It shouldn't give output for english words. It's a bug which I've fixed in https://github.com/subins2000/varnamd/commit/7563a9e25a5c153c50d42001e710fde96bf9abcd

subins2000 avatar Aug 24 '21 19:08 subins2000

Snap_Shot_02425

Ok, i'm just listing out the issue.

josephmiller2000 avatar Aug 25 '21 08:08 josephmiller2000

This issue should be fixed by the changes to the scheme file in https://github.com/varnamproject/libvarnam/pull/152

This is what I get on my system (with the scheme file from that MR): Screenshot_20210825_182121_crop

Kishore96in avatar Aug 25 '21 12:08 Kishore96in

Thank you for confirming it @Kishore96in . Have merged it to GoVarnam. The changes are now live at https://varnam.subinsb.com as well

subins2000 avatar Aug 25 '21 18:08 subins2000

The issue still seems to be reproducible at https://varnam.subinsb.com . @subins2000 Are you not using any wordlist to train that instance for Tamil? If you are interested, I can provide the wordlist that I am using to train varnam.

The 'canonical' way to type 'என்ன' would be 'ennnna', but of course this is not intuitive.

The 'root cause' is that like many other Indian languages, Tamil has multiple sounds which would get mapped to the same English string 'na'. The workaround used in the scheme file for Tamil was to map these sounds to 'na', 'Na', 'nna', and so on (I don't know what the other languages do). In an attempt to allow more 'natural' input, I had modified the scheme file so that all these sounds also have 'na' as a 'secondary' transliteration (the ones inside the nested square brackets). Even with the changes, varnam only shows such suggestions if it is trained with a wordlist (before the changes to the scheme, varnam would not show such suggestions even after learning from a wordlist). Is there some better way to implement this?

To summarize, completely fixing this issue would require changes to the scheme file (already merged) and training with a wordlist.

Kishore96in avatar Aug 26 '21 05:08 Kishore96in

Thank you for the explanation. It makes more sense now. It's kind of difficult to understand since I don't know about the language much.

The issue still seems to be reproducible at https://varnam.subinsb.com . @subins2000 Are you not using any wordlist to train that instance for Tamil? If you are interested, I can provide the wordlist that I am using to train varnam.

In the server https://varnam.subinsb.com there were no words in dictionary except for Malayalam. I have now imported some 1 lakh mostly words for Tamil. The suggestion என்ன now comes for "enna" but it's at 7th. Do you have a good word corpus or is this alright ?

The 'canonical' way to type 'என்ன' would be 'ennnna', but of course this is not intuitive.

How many sounds are there for na in Tamil ? In Malayalam there are (this mapping is the same in Malayalam varnam scheme as well) :

  • na (single) - ന
  • nna (double na) - ന്ന
  • Na (single) - ണ
  • NNa (double Na) - ണ്ണ
  • n or n_ (chill of na) - ൻ

From looking at the Tamil scheme file ன் is mentioned as chill letter. Is it so ? In the malayalam scheme, to bring chillaksharam in between words, we use an underscore after n_. This was a recent change. Usually in Malayalam chillaksharam don't come in between words with rare exceptions. Is that the same for Tamil as well ?

subins2000 avatar Aug 26 '21 18:08 subins2000

How many sounds are there for na in Tamil ?

In Tamil, for 'na', we have ந - tongue touches teeth ன - tongue touches alveolar ridge ண - tongue is slightly curled backwards

As far as I understand, it seems ந and ன are both denoted by ന in Malayalam. The double na-s which you mention would be written in Tamil as ன்ன and ண்ண, i.e. we don't have dedicated conjoined characters to represent those.

From looking at the Tamil scheme file ன் is mentioned as chill letter. Is it so ?

I don't completely understand the concept of 'chill letters' in Malayalam, but it seems to be a variation of other letters that appears only at the end of words. If so, I don't think that concept exists in Tamil.

Kishore96in avatar Sep 05 '21 07:09 Kishore96in