ElasticTabstops Incorrect handling of double-width characters

Take this TSV:

date        |P東       |東 score  |P南       |南 score  |P西       |西 score  |P北       |北 score  |comment
2015-04-04  john    35100       bob     32100       mary    12000       katy    20800
2015-04-04  mary    33500       bob     49500       katy    21600       john    -4600

It looks aligned in ST+elastic tabstops, completely with column headers. But in any other text viewer (less or this Markdown view above) column headers are not aligned — because of an extra space inserted between double-width characters 東南西北 and the following tab character separator.

For clarity, I'll visualize the whitespace characters involved:

date······↦   |P東····↦   |東·score↦   |P南····↦   |南·score↦   |P西····↦   |西·score↦   |P北····↦   |北·score↦   |comment
2015-04-04↦   john···↦   35100···↦   bob····↦   32100···↦   mary···↦   12000···↦   katy···↦   20800
2015-04-04↦   mary···↦   33500···↦   bob····↦   49500···↦   katy···↦   21600···↦   john···↦   -4600

In a fixwidth environment like a terminal (e.g. less), the string |P東 takes 4 character places to render (even though it's a 3-character string: |, P, 東). This is exactly the width that john and mary cells have. But — and this is the bug — john and mary have 3 U+20's after them, while |P東 has 4. This is what breaks alignment in monospace non-elastic-tabstop-aware viewers.

Conceptually, this is easily fixed by using "em width" (which is 1 or 2 for character C where unicodedata.east_asian_width(C)=='Na' or unicodedata.east_asian_width(C)=='W' correspondingly) instead of plain character count when computing the number of spaces that the plugin inserts for compatibility alignment.

Whew. I do realize that this report is futile, but still, it's here for the record.

Oct 20 '15 13:10 ulidtko

Actually I think this might be one of the few reports that's not futile. It wouldn't be too hard to figure out the character width of each character if unicode will tell you like that. My one concern is that in Sublime Text, double-width characters aren't quite double-width. It would probably work fine if you just had one or two and wide tabs, but with a large amount of characters there may be an offset.

Do you feel comfortable hacking python? Want to give it a shot? Or I can look at it sometime soon.

Oct 20 '15 14:10 adzenith

I think the offset is not a problem, since ST's double-width characters appear slightly less than two places, and that should be perfectly compensated by increased width of the following tab. It'd be a problem if wide characters took more space :)

I might give it a try, should be simple...

Oct 21 '15 11:10 ulidtko

Right, it would only be a problem if you had quite a few characters in a row combined with a relatively narrow tab width: at a certain point you might be able to lose enough space so that you end up at an earlier tabstop.

Give it a shot and let me know how it goes! I'm excited to see the PR :)

Oct 21 '15 16:10 adzenith

Well @adzenith, turns out you were quite right: I couldn't get this to work with tab width less than 5!

But otherwise, I'm quite satisfied with the result. Looks excellent in the terminal — what could be desired more :)

PR incoming, any comments welcome

Oct 23 '15 16:10 ulidtko