liblouis icon indicating copy to clipboard operation
liblouis copied to clipboard

begword behaves differently than word when it comes to case insensitivity

Open egli opened this issue 8 years ago • 9 comments

The comment below describes the behavior of word. For the behavior of begword see https://github.com/liblouis/liblouis/issues/507#issuecomment-368043721.


When we define no capitalization related opcodes, such as capsletter or begcapsword, we would expect the translation to ignore any capitalization. In other words if we define the table below we would expect that all the variations of the word "mit" (e.g. "mit", "Mit", "MIT", "mIt", "miT" and "MiT") would be contracted to "t"

uplow Ii 24
uplow Mm 134
uplow Tt  2345
word mit 2345

However at the moment liblouis seems to assume that if capitalization changes mid-word a word rule such as the one above will not apply.

$ echo "mit" | lou_translate foo.tbl
t
$ echo "Mit" | lou_translate foo.tbl
mit
$ echo "MIT" | lou_translate foo.tbl
t
$ echo "MiT" | lou_translate foo.tbl
mit
$ echo "mIt" | lou_translate foo.tbl
mit

Wouldn't it be more logical to assume that if no capitalization opcodes are defined that we could apply the contraction?

egli avatar Feb 01 '18 13:02 egli

Actually what Liblouis does makes sense I think. Maybe not in your case, but I think your case is a bit unusual.

If you don't define any capitalization opcodes, Liblouis can't indicate the capitalization correctly if it applies contractions case-insensitively, like you suggest. Let's take this example:

uplow Ii 247,24
uplow Mm 1347,134
uplow Tt 23457,2345
word mit 134-24-2345

Without the contraction, Mit translates to Mit, i.e. the capitalization info is preserved. With the contraction, Mit would translate to mit, i.e. the capitalization info would be lost. This is OK for German, but not in the general case. In the general case Liblouis has to assume that you want to preserve capitalization, and that you define contractions with capitals explicitly with rules like:

word Mit 134-24-2345

(I'm not sure this works at the moment.)

The solution I suggested before is to define a fake (virtual) cap sign and remove it in a second pass. This would actually be more in line with the German braille rule I think. German braille basically has an empty (zero-width) cap sign because you don't care about capitals in braille. For Liblouis, an empty cap sign is not the same as an undefined cap sign though, hence the trick with the virtual sign and the second pass.

If this solution is too hackish for you, maybe an alternative could be to allow you to explicitly specify that you want to ignore capitalization. Right now it is kind of implicitly specified with the omission of capitalization opcodes and with the uplow opcodes with only one dot pattern, but it's a bit hard to automatically detect this. Actually, we could in theory compare the definitions of all the capitals in the contraction (M) with their lowercase (m) and derive from that whether you want to ignore caps or not, but this sounds a bit obscure. I think I prefer explicit.

bertfrees avatar Feb 12 '18 15:02 bertfrees

What would work is the following:

swapcc uptolow ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz
correct %uptolow %uptolow

egli avatar Feb 19 '18 11:02 egli

So basically there are three cases that need to be supported:

  • "empty" capsign (SBS use case)
  • capitalization as defined by uplow rules (current default when capsign absent)
  • capsign <dot-pattern>

Other cases could be thought of, for example the caps7 proposal from Bue (https://github.com/liblouis/liblouis/issues/252)

bertfrees avatar Feb 23 '18 11:02 bertfrees

liblouis behaves slightly different when you use the begword instead of the word opcode. Just as a reminder with the word opcode:

uplow Ii 24
uplow Mm 134
uplow Tt  2345
word mit 2345

you get the following translation:

$ echo "mit" | lou_translate foo.tbl
t
$ echo "Mit" | lou_translate foo.tbl
mit
$ echo "MIT" | lou_translate foo.tbl
t
$ echo "MiT" | lou_translate foo.tbl
mit
$ echo "mIt" | lou_translate foo.tbl
mit

However when using the begword opcode the behavior is different:

uplow Ii 24
uplow Mm 134
uplow Tt  2345
begword mit 2345
$ echo "mitt" | ./tools/lou_translate foo.tbl
tt
$ echo "Mitt" | ./tools/lou_translate foo.tbl
tt
$ echo "MITT" | ./tools/lou_translate foo.tbl
tt
$ echo "mItt" | ./tools/lou_translate foo.tbl
mitt
$ echo "miTt" | ./tools/lou_translate foo.tbl
mitt

To be consistent "Mitt" should translate to "mitt" when using the begword opcode or "Mit" should translate to "t" when using the word opcode.

egli avatar Feb 23 '18 15:02 egli

I've added the label "bug" because because the behavior of word and begword should indeed be consistent. The behavior of word in the initial comment is correct I think.

bertfrees avatar Jun 21 '19 21:06 bertfrees

Actually I think maybe word mit 2345 shouldn't even match "MIT", and begword mit 2345 shouldn't match "MIT" or "MITT" (if no capital marks are defined that is).

bertfrees avatar Aug 14 '19 21:08 bertfrees

We should clarify what the real requirements are here. There is an attempt to start something along that line in the case sensitivity yaml test.

egli avatar Aug 28 '20 15:08 egli

This issue came up again today.

The workaround that I mentioned in my comment above is apparently used in the German tables: https://github.com/liblouis/liblouis/blob/6dd1e5734138665347ded054dc135d9b8c869c13/tables/de-g0-core.uti#L129-L134 (Not sure if I realized that before). With this workaround, "mit", "Mit" and "MIT" will all be contracted. "MiT" and "mIt" will not be contracted. I still think this is the correct behavior of Liblouis.

Because in the meantime we deprecated uplow, I was curious about whether the behavior changed. I tested it with the following test:

table: |
  lowercase i 24
  lowercase m 134
  lowercase t  2345
  base uppercase I i
  base uppercase M m
  base uppercase T t
  word mit 2345
tests:
  - - "mit"
    - "t"
  - - "Mit"
    - "t"
  - - "MIT"
    - "t"
  - - "MiT"
    - "mit"
  - - "mIt"
    - "mit"

So apparently, even if no capital signs are defined, "Mit" is now contracted. This makes sense. (@egli This means you don't need your fake virtual cap signs anymore.)

"MiT" and "mIt" are still not contracted. Now that uplow is deprecated, my opinion about how Liblouis should behave when no capital signs are defined changed: I'm not convinced anymore that in that case Liblouis should not contract "MiT" and "mIt".

bertfrees avatar Jan 06 '22 12:01 bertfrees

"MiT" and "mIt" are still not contracted. Now that uplow is deprecated, my opinion about how Liblouis should behave when no capital signs are defined changed: I'm not convinced anymore that in that case Liblouis should not contract "MiT" and "mIt".

Hm, interesting.

The issue has become kinda moot, as we downcase the input in the German tables now. The "Fake"-use of capitalization doesn't exist anymore.

So I'm not sure what to do with this issue. Should we close it and add another one specific to MiT" and "mIt"?

egli avatar May 19 '22 08:05 egli