Pythonista-Issues icon indicating copy to clipboard operation
Pythonista-Issues copied to clipboard

Wrong (or at least non-standard) handling of umlaut/dieresis characters in editor (üöä)

Open biasedlogic opened this issue 5 months ago • 2 comments

//System info at the bottom

Issue: German Umlauts ÄÖÜ/äöü (and, I suspect, other language related combined/accented letters) when entered as literals in the editor are not stored / passed to objects properly encoded (i.e. with „NFC“-form, where each character occupies exactly one position in a string). This creates an ‚str‘ object with a chain of codes that would be roughly right for a bytearray, but not for a string. Also, the editor‘s behaviour breaks: it takes two right-arrow / left-arrow strikes to traverse a single Umlaut-letter and deleting these characters seems sometimes inconsistent.

Background of the problem There are characters that have multiple possible unicode encodings, like the German „Umlauts” (äöü), which can be stored as a pair of separate codepoints, a base letter (e.g. „a”) and a combining dieresis character („ ¨ ”) or as a single character (e.g. „ä”). The most useful way is to store them in the minimal form (i.e. „NFC”-form), because this means that a) in the editor backspacing over an ä deletes the whole thing, which is how each and any editor/GUI would treat it — it is just a single letter b) more importantly processing strings typed into the editor as literals gets inconsistent with other processing environments. However, unlike all other implementations that I have tested, for some reason Pythonista decides to break apart each single German Umlaut into two separate characters, claiming that the four-letter word „März” is, indeed, five characters long.

Show and tell:

Try the following code in Pythonista:

s = "März"

print(f"The Length of string '{s}' is {len(s)}")
for c in s:
	print(f"Character '{c}' is alphanumeric?: {c.isalpha()}") `

in Pythonista this results in:

The Length of string 'März' is 5
Character 'M' is alphanumeric?: True
Character 'a' is alphanumeric?: True
Character '̈' is alphanumeric?: False
Character 'r' is alphanumeric?: True
Character 'z' is alphanumeric?: True

In other environments e.g. Colab (see https://colab.research.google.com/drive/1NPChlenbDdGk2atTRIiu89LmPeiKY-Qz?usp=sharing) the result is the expected:

The Length of string 'März' is 4
Character 'M' is alphanumeric?: True
Character 'ä' is alphanumeric?: True
Character 'r' is alphanumeric?: True
Character 'z' is alphanumeric?: True

The code can be copy-pasted between Colab and Pythonista, Pythonista will break the single unicode letter apart, Colab (or Python on Windows PC, or Jupyter Notebooks, or Python on a Linux machine or on my Android phone…) will treat them as they should be: as a single letter, where the example word „März” is four characters long.

Pythonista 3.4 (340012) --- SYSTEM INFORMATION --- System Information

  • Pythonista N/A (N/A), Default interpreter 3.10.4
  • iOS 18.6.2, model iPad14,10, resolution (portrait) 2048.0 x 2732.0 @ 2.0 --- SYSTEM INFORMATION ---

biasedlogic avatar Sep 11 '25 18:09 biasedlogic

This is getting even ‚funnier’… If I create a file with the example word „März” in another text editor, say, in the Thony IDE on Windows PC, it will contain properly encoded ä, and when run on Pythonista it will produce len(„März”)==4. If I, however, select all code in Pythonista, copy and paste it into another script, it will change the character encoding and result in len(„März”)==5.

~I don‘t honestly know, if the following file will not lose its encoding through export/upload/attachment, I will check the round trip after posting this comment, but~ Confirmed, it is uploaded correctly. It contains two seemingly identical pieces of code, first one typed in Thony on PC, then shared as .py file and loaded into Pythonista (imported). The second block is copy-pasted in Pythonista from the first block (select->CMD-C->CMD-V), when the script is run it generates two different outputs:

march_unmadness.py

The Length of string 'März' is 4
Character 'M' is alphanumeric?: True
Character 'ä' is alphanumeric?: True
Character 'r' is alphanumeric?: True
Character 'z' is alphanumeric?: True
The Length of string 'März' is 5
Character 'M' is alphanumeric?: True
Character 'a' is alphanumeric?: True
Character '̈' is alphanumeric?: False
Character 'r' is alphanumeric?: True
Character 'z' is alphanumeric?: True

from the two blocks, that SHOULD really be identical at this moment (copy/paste). (and yes, running the march_unmadness.py file in Thonny will give you the same result, as above, because the strings ARE encoded in the file in two different ways, however, unlike in Pythonista, if you copy-paste any of these blocks in Thonny it will keep behaving as the source block was, so the encoding is kept consistent between copy and paste, so will Colab, but not Pythonista).

biasedlogic avatar Sep 12 '25 07:09 biasedlogic

This behaviour is also not exactly consistent across Pythonista experience, it seems not to be iPad‘s own issue, as if I pass the same typed string via an UI text box, it is ‚properly‘ encoded and len() returns 4 if the string was März, even if I type it in using US keyboard by pressing [SHIFT-m] [ALT+u] [a] [r] [z] (which could, indeed, produce 5-letter string) instead of the German keyboard direct input [M] [ä] [r] [z]: Image

biasedlogic avatar Sep 12 '25 07:09 biasedlogic