Write a new Non-ASCII-to-TeX conversion file
We need a new utf8.sty file along the lines of ecma94.sty. Also a new utf8.w transliteration table similar to ecma94.w might be useful – this will require an extended @l directive for multi-byte encoded characters.
See this post for starters.
The macro file utf8plainmac works nicely by introducing ‘special’ characters. However, at least at first try, (German) hyphenation doesn't work as with ‘Latin-1’-encoded input. This might have something to do with T1-font-encoding.
And here is a nice UTF-8 table. (Look for block ‘Latin-1 Supplement’ to find #c3 84 for ‘Ä’ and block ‘Currency Symbols’ to find #e2 82 ac for ‘€’.)
Scanning a transliteration directive @l starts in ctangle at line 1509. Instead of expecting exactly two hex-bytes and a whitespace, use sscanf(loc,"%x",(uint32_t)&i); on 2–8 hex-bytes.
But don't use a full transliteration table indexed on uint32_t, because this would be huge; use a sparse matrix/map instead (possibly in C++).
With this latest commit CTANGLE has learned to handle UTF-8 characters in C identifiers, at least in part. See the TeX part of modified section 59 for details.
Here's a very short excerpt of a possible utf8.w include file:
@q<Strip leading byte 'c3'@>
@l 84 Ae
@l 96 Oe
@l 9c Ue
@l a4 ae
@l b6 oe
@l bc ue @q<collision with 'c2 ue'=='¼'@>
@l 9f ss
@q<Strip leading byte 'e2' and first continuation byte '82'@>
@l ac EURO