Open questions for Unicode identifiers

Open sunfishcode opened this issue 4 years ago • 0 comments

#119 requires identifiers to be lower-case stream-safe NFC kebab-case where each part delimited by '-'s starts with a XID_Start scalar value with a zero canonical combining class.

Concerns which are not addressed yet include:

Whole-script confusables (eg. U+61 vs. U+430)
Mixed-script confusables
Width-sensitivity (eg. U+61 (a) vs U+ff41 (ａ))
Should scripts no longer in active use, such as Linear B, be disallowed?
Should we restrict identifier parts from starting with 'Grapheme_Extend = Yes', such as U+1885?
The idea is to propose these rules for interface-types itself, but: do we really want component instantiation to do NFC validation and potentially other complex Unicode tests? This is about implementation simplicity, instantiation efficiency, and Unicode version sensitivity.
Should wit-bindgen's parser automatically normalize to NFC, rather than simply erroring on identifiers that aren't normalized?

Dec 21 '21 20:12 sunfishcode