Incorrect handling of double-width characters
Take this TSV:
date |P東 |東 score |P南 |南 score |P西 |西 score |P北 |北 score |comment
2015-04-04 john 35100 bob 32100 mary 12000 katy 20800
2015-04-04 mary 33500 bob 49500 katy 21600 john -4600
It looks aligned in ST+elastic tabstops, completely with column headers. But in any other text viewer (less or this Markdown view above) column headers are not aligned — because of an extra space inserted between double-width characters 東南西北 and the following tab character separator.
For clarity, I'll visualize the whitespace characters involved:
date······↦ |P東····↦ |東·score↦ |P南····↦ |南·score↦ |P西····↦ |西·score↦ |P北····↦ |北·score↦ |comment
2015-04-04↦ john···↦ 35100···↦ bob····↦ 32100···↦ mary···↦ 12000···↦ katy···↦ 20800
2015-04-04↦ mary···↦ 33500···↦ bob····↦ 49500···↦ katy···↦ 21600···↦ john···↦ -4600
In a fixwidth environment like a terminal (e.g. less), the string |P東 takes 4 character places to render (even though it's a 3-character string: |, P, 東). This is exactly the width that john and mary cells have. But — and this is the bug — john and mary have 3 U+20's after them, while |P東 has 4. This is what breaks alignment in monospace non-elastic-tabstop-aware viewers.
Conceptually, this is easily fixed by using "em width" (which is 1 or 2 for character C where unicodedata.east_asian_width(C)=='Na' or unicodedata.east_asian_width(C)=='W' correspondingly) instead of plain character count when computing the number of spaces that the plugin inserts for compatibility alignment.
Whew. I do realize that this report is futile, but still, it's here for the record.
Actually I think this might be one of the few reports that's not futile. It wouldn't be too hard to figure out the character width of each character if unicode will tell you like that. My one concern is that in Sublime Text, double-width characters aren't quite double-width. It would probably work fine if you just had one or two and wide tabs, but with a large amount of characters there may be an offset.
Do you feel comfortable hacking python? Want to give it a shot? Or I can look at it sometime soon.
I think the offset is not a problem, since ST's double-width characters appear slightly less than two places, and that should be perfectly compensated by increased width of the following tab. It'd be a problem if wide characters took more space :)
I might give it a try, should be simple...
Right, it would only be a problem if you had quite a few characters in a row combined with a relatively narrow tab width: at a certain point you might be able to lose enough space so that you end up at an earlier tabstop.
Give it a shot and let me know how it goes! I'm excited to see the PR :)
Well @adzenith, turns out you were quite right: I couldn't get this to work with tab width less than 5!
But otherwise, I'm quite satisfied with the result. Looks excellent in the terminal — what could be desired more :)
PR incoming, any comments welcome