cwebbin icon indicating copy to clipboard operation
cwebbin copied to clipboard

Write a new Non-ASCII-to-TeX conversion file

Open ascherer opened this issue 8 years ago • 3 comments

We need a new utf8.sty file along the lines of ecma94.sty. Also a new utf8.w transliteration table similar to ecma94.w might be useful – this will require an extended @l directive for multi-byte encoded characters.

ascherer avatar Jul 09 '17 15:07 ascherer

See this post for starters.

The macro file utf8plainmac works nicely by introducing ‘special’ characters. However, at least at first try, (German) hyphenation doesn't work as with ‘Latin-1’-encoded input. This might have something to do with T1-font-encoding.

ascherer avatar Jul 09 '17 17:07 ascherer

And here is a nice UTF-8 table. (Look for block ‘Latin-1 Supplement’ to find #c3 84 for ‘Ä’ and block ‘Currency Symbols’ to find #e2 82 ac for ‘€’.)

ascherer avatar Jul 10 '17 14:07 ascherer

Scanning a transliteration directive @l starts in ctangle at line 1509. Instead of expecting exactly two hex-bytes and a whitespace, use sscanf(loc,"%x",(uint32_t)&i); on 2–8 hex-bytes.

But don't use a full transliteration table indexed on uint32_t, because this would be huge; use a sparse matrix/map instead (possibly in C++).

ascherer avatar Jul 10 '17 18:07 ascherer

With this latest commit CTANGLE has learned to handle UTF-8 characters in C identifiers, at least in part. See the TeX part of modified section 59 for details.

Here's a very short excerpt of a possible utf8.w include file:

@q<Strip leading byte 'c3'@>
@l 84 Ae
@l 96 Oe
@l 9c Ue
@l a4 ae
@l b6 oe
@l bc ue @q<collision with 'c2 ue'=='¼'@>
@l 9f ss

@q<Strip leading byte 'e2' and first continuation byte '82'@>
@l ac EURO

ascherer avatar Aug 21 '22 08:08 ascherer