unidecoder NFC normalization, what's it good for?

Hey @norman, do you remember the examples or cases that led you add the automatic NFC normalization?

I'm having trouble coming up with cases where it does anything useful; since unidecoder generally operates on only the first byte, most decomposed sequences still seem to work properly without the NFC.

But there are probably some cases where byte sequences not in NFC form won't be 'ascii'ized' properly, or you wouldn't have added the feature -- I'd like to identify them and add a test or two for em, if possible!

Thanks for any tips

Oct 07 '13 14:10 jrochkind

The original motivation came from https://github.com/rsl/stringex/issues/10.

Oct 07 '13 14:10 norman

Hm, thanks @norman.

Huh, that ticket asked for normalizing to NFKD (compatibility, de-composed). But the code actually in unidecoder seems to normalize to NFC (composed, without compatibility).

So I'm still confused, it's not clear to me that the actual code would do anything for the use case mentioned there. But I guess I can mess around with test cases, and see what I can figure out.

Oct 07 '13 16:10 jrochkind

Okay, doing a "K" (Compatibility) normalization in unicode changes ™ to "TM", with or without unidecoder.

require 'unicode_utils'
tm = "™"
UnicodeUtils.nfkc(tm) #=> "TM"
UnicodeUtils.nfkd(tm) #=> "TM"

However, NFC normalization alone does not change ™ to TM:

UnicodeUtils.nfc(tm) # => "™"
UnicodeUtils.nfd(tm) # => "™"

Nor does making it NFC first somehow let Unidecoder turn ™ to "TM":

UnicodeUtils.nfc(tm).to_ascii # => ""

So I'm thinking the NFC normalization as added was a mistake, it doesn't actually do anything useful. Anything I'm missing?

You might want NFKC instead. But it's really a different issue -- Unicode K normalization does some of the things the unidecoder does already, they overlap. But you'd expect the unidecoder could convert "™" to "TM" itself already, but this is missing from it's data tables for whatever reason. Hmm.

Oct 07 '13 16:10 jrochkind

I'm still thinking NFC normalization doesn't actually do anything useful; NFKC would, although partially because of missing mappings in unidecoder (presumably inherited from the Perl).

Standard unicode NFKC is pretty darn good at turning things like ™ to TM and ① to "1". But NFKC does not transliterate non-latin scripts like unidecoder does (although I personally don't find that too too useful). And NFKC also does not remove accents from chars, changing é to e, or do other thigns like change ø to o (that's the part I find useful).

Unidecoder theoretically does all of that, including much of what NFKC already does -- except missing mappings in unidecoder's data files (maybe for unicode codepoints that didn't exist when the original perl version was created?) mean that NFKC does some things that unidecoder doesn't, like TM.

Alphabets are complicated.

Oct 07 '13 17:10 jrochkind

Your logic sounds totally reasonable. I never got the the bottom of why using the normalization worked for this example, and likely was misled by the fact that it appeared to do what I wanted. At this point you probably know the codebase better than I do given how much time has passed since I worked on this. Carte blanche to fix as you see fit!

or do other thigns like change ø to o (that's the part I find useful).

Yup, the fact that all Han characters are treated as Mandarin Chinese is for me a fatal flaw in this library.

Also keep in mind that "visual" decomposition and Unicode decomposition don't match up 100%. For example, ø doesn't decompose to "o" and "/", it's somewhat surprisingly - at least given my native speaker bias - a single glyph, so if you want to approximate that to ASCII you need a library like this one that maps the codepoint to an ASCII approximation.

The main reason I essentially abandoned this library is because I consider this type of conversion to be no more than a neat parlor trick. It breaks down quickly when you make it sensitive to languages. For example, German orthography already has a rule that "ü" and "ue" are conceptually equivalent. But if you try to apply that rule to Spanish, you get gibberish. That's why I created the Babosa library, which has built-in approximations for many European languages that have been vetted by native speakers to make sure they make sense.

Oct 07 '13 17:10 norman

cool. (Except as far as I can tell, the fix you put in for NFC does not work in the 'tm' use case! The ticket you reference even has you admitting that, heh. NFKC was what you wanted, not NFC).

Note there is Unicode Technical Report 30 (UTR30), which contains a Diacritic Folding specification,which will also do things like remove diacritics, and send ø to o.

However, also note that UTR30 was withdrawn by Unicode in draft form, it never proceeded beyond draft and will not. I believe because of what you are mentioning, the difficulty of doing this in a context/locale-appropriate way.

Nonetheless, since I'm mainly targetting a US English audience myself, the 'good enough' of either unidecoder or UTR30 Diacritic Folding are likely Good Enough. (Solr, for instance, includes UTR#30 folding in one of it's built-in analyzers; people often do find it useful).

At the moment, I'm actually leaning more toward trying to implement UTR#30 Diacritic Folding in ruby, than to unidecoder, I think it may suit my needs better, especially when combined with NFKC. However, figuring out how to efficiently implement UTR#30 Diacritic Folding is a bit tricky for me too, heh.

I will definitely check out your Barbosa library, I didn't know about that.

Oct 07 '13 19:10 jrochkind

The ticket you reference even has you admitting that, heh.

Ha, I think I'm going a little senile now that I've hit 40, that ticket is a very vague memory. :-)

Oct 07 '13 20:10 norman

Oh man, i'm dealing with search/retrieval issues, rather than display. Theoretically there's a way to use the Unicode Collation algorithm for search/retrieval, in a locale-specific way, that also ends up folding diacritics etc (depending on locale). http://www.unicode.org/reports/tr10/#Searching .

I can understand the text enough to see they're saying they're giving you a way to do it; but not enough to figure out how to do it! Man, the more I try to figure out how to do this stuff, the further down the rabbit hole I go.

Oct 07 '13 20:10 jrochkind

man oh man, and in yet more possibilities, another gem that uses an approach very similar to unidecoder, but with it's own transliteration tables (not part of the unidecoder lineage as far as I can tell),a nd with more limited scope (doesn't try to transliterate so many other scripts, mostly just removes diacritics), is i18n.

I18n.transliterate("øé")  # => "oe"

Oct 07 '13 22:10 jrochkind

Yup, I wrote that. :)

Sent from my phone

Oct 08 '13 00:10 norman

Ha, I am following in your footsteps. Yeah, I18n.transliterate doesn't do quite what I need either.

Okay, after spending several hours reading and re-reading UTS#10 Unicode Collation Algorithm -- I am feeling increasingly optimistic that the solution to my actual needs actually lies in there. (I needed to normalize for purposes of searching, not actually display). I'll have to put together a bunch of test cases to verify, and probably figure out some more stuff about exactly how to use the various options/parameters/algorithms UCA gives you, to be sure. But that direction is looking good.

(And the twitter_cldr gem gives pure-ruby access to the UCA, with standard locale tailorings. Although I think it needs a tiny feature or two currently lacking, added to it, for my needs).

Phew, this stuff is complicated. But I'm continually amazed at how well done the Unicode algorithms and data tables are, complete, consistent, well though out. And what darn good, educational reading the Unicode Technical Reports generally are. If you're into and/or work with this stuff, and you haven't spent some quality time with UAX#15 and UTS#30 yet, I highly recommend it.)

Thanks for letting me bounce stuff off you, and providing this code to look at and stuff, very helpful.

Oct 08 '13 03:10 jrochkind