cpython icon indicating copy to clipboard operation
cpython copied to clipboard

Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER

Open f8a59035-e116-44bb-8c5c-ddaf44c84ed6 opened this issue 8 years ago • 14 comments

BPO 32354
Nosy @vstinner, @ezio-melotti, @ideasman42

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2017-12-18.02:00:12.723>
labels = ['expert-unicode']
title = 'Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER'
updated_at = <Date 2019-10-27.16:17:44.802>
user = 'https://github.com/ideasman42'

bugs.python.org fields:

activity = <Date 2019-10-27.16:17:44.802>
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Unicode']
creation = <Date 2017-12-18.02:00:12.723>
creator = 'ideasman42'
dependencies = []
files = []
hgrepos = []
issue_num = 32354
keywords = []
message_count = 4.0
messages = ['308506', '308523', '308607', '355480']
nosy_count = 3.0
nosy_names = ['vstinner', 'ezio.melotti', 'ideasman42']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue32354'
versions = ['Python 3.6']

Linked PRs

  • gh-117117

Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER are marked as deprecated in the docs.

https://docs.python.org/3/c-api/unicode.html?highlight=py_unicode_tolower#c.Py_UNICODE_TOLOWER

Someone submitted a patch to our project which uses these.

What is unclear, is if there is an intention to replace these (wrappers for '_PyUnicode_ToLowerFull' for eg).

Or if this will be removed without any alternative.

I assume the functionality will be kept somewhere since Python needs str.lower/upper.

Will there be a way to perform this in the future? Either way, could docs be updated to reflect this?

Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER *API* doesn't respect the latest Unicode standard. For example, it doesn't support this operation:

>>> "ß".upper()
'SS'

vstinner avatar Dec 18 '17 08:12 vstinner

Thanks for the info, in that case will there be a way to do this from the CPython API which can work with raw strings? (utf8, wchat_t or similar types), that doesn't require the GIL and alloc/free of PyObjects.

See also bpo-38604: Schedule Py_UNICODE API removal.

vstinner avatar Oct 27 '19 16:10 vstinner

@vstinner Can we please revisit this? Py_UNICODE_TOLOWER and Py_UNICODE_TOUPPER have been deprecated since 3.3, but have not been removed.

Also, there's now no way to upper/lower a ucs4 character using the C api, which follows the latest Unicode standard and handles all the obscure cases like capital sigma, like str.lower()/upper() do. What do you think we should do? Remove them for good or maybe add a new macro that uses the new set of functions?

lysnikolaou avatar Dec 06 '23 13:12 lysnikolaou

In C, you can create a sub-string of a single character (ex: using PyUnicode_FromOrdinal()) and then call the upper() method on it. I'm not sure that a new API is needed for that.

Would you mind to elaborate your use case?

Py_UNICODE_TOLOWER and Py_UNICODE_TOUPPER have been deprecated since 3.3, but have not been removed.

Ah right, we should start to deprecate them.

vstinner avatar Dec 06 '23 13:12 vstinner

Would you mind to elaborate your use case?

If possible, I'd like to skip creating unicode objects and work directly on unicode codepoints.

The usecase for this is that we're trying to speed up operations on numpy string arrays, so we're writing numpy ufuncs that operate directly on C buffers, as opposed to the previous approach which was to create PyObjects and call the methods on them. We've observed a very good speed-up, so we're trying to do it for every operation a numpy string supports. See numpy/numpy#24952 for how I'm currently trying to implement it.

Ah right, we should start to deprecate them.

I can start working on a PR to remove these macros if you want.

lysnikolaou avatar Dec 06 '23 14:12 lysnikolaou

@vstinner Quick ping here.

lysnikolaou avatar Feb 26 '24 11:02 lysnikolaou

Thanks for the info, in that case will there be a way to do this from the CPython API which can work with raw strings? (utf8, wchat_t or similar types), that doesn't require the GIL and alloc/free of PyObjects.

I suggest calling str.lower(), str.title() and str.upper() methods which need a Python str object, the GIL, and to manage reference count. No, there is no API which fits your use cases directly.

vstinner avatar Mar 13 '24 08:03 vstinner

It would be incredibly helpful to some of our work on NumPy, if there was a way to do tolower / toupper for single codepoints, which is exactly what the deprecated macros used to do. Is there any way we can undeprecate them?

lysnikolaou avatar Mar 13 '24 11:03 lysnikolaou

I propose the API:

Py_ssize_t PyUnicode_ToLower(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
Py_ssize_t PyUnicode_ToUpper(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
Py_ssize_t PyUnicode_ToTitle(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
  • On success, write characters into buffer and return a number >= 1.
  • If the buffer is too small, return -1.

vstinner avatar Mar 14 '24 10:03 vstinner

I wrote a script to check for the maximum output length in Python 3.13.

  • lower() expands a single character up to 2 characters.
  • upper() expands a single character up to 3 characters.
  • title() expands a single character up to 3 characters.

To be safe, I would suggest to pass a buffer of 5 characters, "just in case", to be future proof.

Script:

def dump(text):
    chars = ' '.join(f'U+{ord(ch):04x}' for ch in text)
    return f"{text!r} ({chars})"

max_chr = 0x10_ffff
maxl = 0
maxu = 0
maxt = 0
for ch in range(0, max_chr+1):
    ch = chr(ch)

    output = ch.lower()
    x = len(output)
    if x > maxl:
        maxl = x
        print(f"New max lower()={maxl}: {dump(ch)} => {dump(output)}")

    output = ch.upper()
    x = len(output)
    if x > maxu:
        maxu = x
        print(f"New max upper()={maxu}: {dump(ch)} => {dump(output)}")

    output = ch.title()
    x = len(output)
    if x > maxt:
        maxt = x
        print(f"New max title()={maxt}: {dump(ch)} => {dump(output)}")

Ouptut (sorted manually):

New max lower()=1: '\x00' (U+0000) => '\x00' (U+0000)
New max lower()=2: 'İ' (U+0130) => 'i̇' (U+0069 U+0307)

New max upper()=1: '\x00' (U+0000) => '\x00' (U+0000)
New max upper()=2: 'ß' (U+00df) => 'SS' (U+0053 U+0053)
New max upper()=3: 'ΐ' (U+0390) => 'Ϊ́' (U+0399 U+0308 U+0301)

New max title()=1: '\x00' (U+0000) => '\x00' (U+0000)
New max title()=2: 'ß' (U+00df) => 'Ss' (U+0053 U+0073)
New max title()=3: 'ΐ' (U+0390) => 'Ϊ́' (U+0399 U+0308 U+0301)

vstinner avatar Mar 14 '24 11:03 vstinner

Wow, I didn't know that upper-casing or lower-casing could lead to more than one characters being returned. Either way, the API looks good to me, and would definitely be okay to use in NumPy. Would you like me to work on it?

lysnikolaou avatar Mar 14 '24 11:03 lysnikolaou

The implementation already exists:

int _PyUnicode_ToLowerFull(Py_UCS4 ch, Py_UCS4 *res)
int _PyUnicode_ToTitleFull(Py_UCS4 ch, Py_UCS4 *res)
int _PyUnicode_ToUpperFull(Py_UCS4 ch, Py_UCS4 *res)

We should write a public API on top of it which takes care of the buffer size.

Oh, case_operation() makes the expectation that a single character can only expand to 3 characters. So maybe 3 is enough.

@lysnikolaou: Sure, you can propose a PR to implement it.

I'm not sure if we should expose a "title" operation and "case folding" operation. In case of doubt, I propose to start with the minimum: upper and lower.

vstinner avatar Mar 14 '24 11:03 vstinner

The issues with #117117:

  • Buffer overflow if specify too small buffer.
  • Returning -1 on failure loses information about what size of the buffer is required.

I think that the current C API is good as is, we only should make it public. We can also extend it to support NULL as res -- the result will say us what size of the buffer is needed. But this may be overkill, because the maximal size is not so large, we always can use an output buffer of static size.

serhiy-storchaka avatar Mar 22 '24 11:03 serhiy-storchaka

We can provide a constant which is the minimum buffer size.

vstinner avatar Mar 22 '24 11:03 vstinner

I've opened #136176 to move this forward. I did not add a constant, but I can certainly add it if you think it's worth it.

lysnikolaou avatar Jul 01 '25 13:07 lysnikolaou