lttoolbox icon indicating copy to clipboard operation
lttoolbox copied to clipboard

Bug when transliterating Unicode digraphs

Open TDavLinguist opened this issue 8 months ago • 2 comments

I have been working on a project to develop a Sgaw Karen [ksw] to Thai [tha] language pair and as part of the project, I wanted to develop a transliterator between the three+ Sgaw Karen orthographies. Using https://user.keio.ac.jp/~kato/SgawKarenRomei.pdf as a guide, I developed a small test .dix, which I show below:

<dictionary>
 <alphabet>abcdefghijklmnopqrstuvwxyz ဖခဘ ံ ိ ၣ်</alphabet>
 <sdefs/>
 <section id="consonants" type="inconditional">
   <e><p><l>hp</l><r>ဖ</r></p></e>
   <e><p><l>hk</l><r>ခ</r></p></e>
   <e><p><l>b</l><r>ဘ</r></p></e>
 </section>
 <section id="vowels" type="inconditional">
   <e><p><l>i</l><r>ံ</r></p></e>   <!-- U+1036, bytes: e1 80 b6 -->
   <e><p><l>o</l><r>ိ</r></p></e>   <!-- U+102D, bytes: e1 80 ad -->
   <e><p><l>a</l><r></r></p></e>   <!-- empty output is okay -->
   <e><p><l>f</l><r>ၣ်</r></p></e>
 </section>
</dictionary>

It seems to successfully compile as a .bin with lt-comp

lt-comp lr rom-test.dix rom-test.bin
consonants@inconditional 4 5
vowels@inconditional 3 5

and lt-expand shows the correct mapping:

lt-expand rom-test.dix 
hp:ဖ
hk:ခ
b:ဘ
i:ံ
o:ိ
a:
f:ၣ်

However, when testing with lt-proc -t I get incorrect output:

printf 'hpi hpi\n' | lt-proc -t ./rom-test.bin
ဖi ဖi

(Expected output: ဖံ ဖံ)

It seems that none of the vowels will render after a consonant, but a vowel by itself or in succession will render just fine:

printf 'i i\n' | lt-proc -t ./rom-test.bin
ံ

To be sure, I ran the first prompt through hexdump and it confirmed that the 'i' is just passing through as-is. So it seems to be a compilation problem, not a unicode problem. (or is it a compilation problem stemming from a Unicode problem?) `` printf 'hpi hpi\n' | lt-proc -t ./rom-test.bin | hexdump -C 00000000 e1 80 96 69 20 e1 80 96 69 0a |...i ...i.| 0000000a

**Update**
Interestingly, the vowels transliterate without issue if there is a space between them and the digraphs:

echo "hk i" | lt-proc -t rom-test.bin ခ ံ

However, of course, the consonant and vowel need to be together (ခံ), which is not an issue with non-digraph inputs

echo "bi" | lt-proc -t rom-test.bin ဘံ

Any help would be appreciated!

TDavLinguist avatar Aug 14 '25 13:08 TDavLinguist

Potential workaround:

$ cat f.dix
<dictionary>
  <alphabet>abcdefghijklmnopqrstuvwxyz ဖခဘ ံ ိ ၣ်</alphabet>
  <sdefs/>

  <pardef n="C">
    <e><p><l>hp</l><r>ဖ</r></p></e>
    <e><p><l>hk</l><r>ခ</r></p></e>
    <e><p><l>b</l><r>ဘ</r></p></e>
  </pardef>
  <pardef n="V">
    <e><p><l>i</l><r>ံ</r></p></e>   <!-- U+1036, bytes: e1 80 b6 -->
    <e><p><l>o</l><r>ိ</r></p></e>   <!-- U+102D, bytes: e1 80 ad -->
    <e r="LR"><p><l>a</l><r></r></p></e>   <!-- empty output is okay -->
    <e><p><l>f</l><r>ၣ်</r></p></e>
  </pardef>

  <section id="main" type="inconditional">
    <e><par n="C"/><par n="V"/></e>
    <e><par n="C"/></e>
    <e><par n="V"/></e>
  </section>
</dictionary>
$ lt-comp lr f.dix f.bin
main@inconditional 6 14
$ printf 'hpi hpi\n' | lt-proc -t f.bin
ဖံ ဖံ
$ printf 'i i\n' | lt-proc -t f.bin
 ံ
$ printf 'b hp hk\n' | lt-proc -t f.bin
ဘ ဖ ခ
$ printf 'bi hpa hko bf\n' | lt-proc -t f.bin
ဘံ ဖ ခိ ဘၣ်

I'm not sure what exactly the rule is here, should a top-level entry always contain a full composed-char? (Ie. when you start from section, you can't have "dangling" or incomplete characters)

It still feels like this shouldn't be needed, but I guess it depends on how complicated it is to fix and how much work it is for such workarounds to cover all cases.

unhammer avatar Aug 14 '25 13:08 unhammer

Thank you @unhammer for the workaround! I can confirm it's doing what I want it to for now at least. Your point on "I'm not sure what exactly the rule is here" mimics my thoughts exactly. When looking at the example given on the wiki page and compiling it, lt-proc seemed to handle digraphs just fine! Maybe once I get more adept at it, I'll volunteer an edit for the wiki :)

TDavLinguist avatar Aug 14 '25 13:08 TDavLinguist