Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER
| BPO | 32354 |
|---|---|
| Nosy | @vstinner, @ezio-melotti, @ideasman42 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
assignee = None
closed_at = None
created_at = <Date 2017-12-18.02:00:12.723>
labels = ['expert-unicode']
title = 'Unclear intention of deprecating Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER'
updated_at = <Date 2019-10-27.16:17:44.802>
user = 'https://github.com/ideasman42'
bugs.python.org fields:
activity = <Date 2019-10-27.16:17:44.802>
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Unicode']
creation = <Date 2017-12-18.02:00:12.723>
creator = 'ideasman42'
dependencies = []
files = []
hgrepos = []
issue_num = 32354
keywords = []
message_count = 4.0
messages = ['308506', '308523', '308607', '355480']
nosy_count = 3.0
nosy_names = ['vstinner', 'ezio.melotti', 'ideasman42']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue32354'
versions = ['Python 3.6']
Linked PRs
- gh-117117
Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER are marked as deprecated in the docs.
https://docs.python.org/3/c-api/unicode.html?highlight=py_unicode_tolower#c.Py_UNICODE_TOLOWER
Someone submitted a patch to our project which uses these.
What is unclear, is if there is an intention to replace these (wrappers for '_PyUnicode_ToLowerFull' for eg).
Or if this will be removed without any alternative.
I assume the functionality will be kept somewhere since Python needs str.lower/upper.
Will there be a way to perform this in the future? Either way, could docs be updated to reflect this?
Py_UNICODE_TOLOWER / Py_UNICODE_TOUPPER *API* doesn't respect the latest Unicode standard. For example, it doesn't support this operation:
>>> "ß".upper()
'SS'
Thanks for the info, in that case will there be a way to do this from the CPython API which can work with raw strings? (utf8, wchat_t or similar types), that doesn't require the GIL and alloc/free of PyObjects.
See also bpo-38604: Schedule Py_UNICODE API removal.
@vstinner Can we please revisit this? Py_UNICODE_TOLOWER and Py_UNICODE_TOUPPER have been deprecated since 3.3, but have not been removed.
Also, there's now no way to upper/lower a ucs4 character using the C api, which follows the latest Unicode standard and handles all the obscure cases like capital sigma, like str.lower()/upper() do. What do you think we should do? Remove them for good or maybe add a new macro that uses the new set of functions?
In C, you can create a sub-string of a single character (ex: using PyUnicode_FromOrdinal()) and then call the upper() method on it. I'm not sure that a new API is needed for that.
Would you mind to elaborate your use case?
Py_UNICODE_TOLOWER and Py_UNICODE_TOUPPER have been deprecated since 3.3, but have not been removed.
Ah right, we should start to deprecate them.
Would you mind to elaborate your use case?
If possible, I'd like to skip creating unicode objects and work directly on unicode codepoints.
The usecase for this is that we're trying to speed up operations on numpy string arrays, so we're writing numpy ufuncs that operate directly on C buffers, as opposed to the previous approach which was to create PyObjects and call the methods on them. We've observed a very good speed-up, so we're trying to do it for every operation a numpy string supports. See numpy/numpy#24952 for how I'm currently trying to implement it.
Ah right, we should start to deprecate them.
I can start working on a PR to remove these macros if you want.
@vstinner Quick ping here.
Thanks for the info, in that case will there be a way to do this from the CPython API which can work with raw strings? (utf8, wchat_t or similar types), that doesn't require the GIL and alloc/free of PyObjects.
I suggest calling str.lower(), str.title() and str.upper() methods which need a Python str object, the GIL, and to manage reference count. No, there is no API which fits your use cases directly.
It would be incredibly helpful to some of our work on NumPy, if there was a way to do tolower / toupper for single codepoints, which is exactly what the deprecated macros used to do. Is there any way we can undeprecate them?
I propose the API:
Py_ssize_t PyUnicode_ToLower(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
Py_ssize_t PyUnicode_ToUpper(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
Py_ssize_t PyUnicode_ToTitle(Py_UCS4 ch, Py_UCS4 *buffer, Py_ssize_t size);
- On success, write characters into
bufferand return a number >= 1. - If the buffer is too small, return
-1.
I wrote a script to check for the maximum output length in Python 3.13.
- lower() expands a single character up to 2 characters.
- upper() expands a single character up to 3 characters.
- title() expands a single character up to 3 characters.
To be safe, I would suggest to pass a buffer of 5 characters, "just in case", to be future proof.
Script:
def dump(text):
chars = ' '.join(f'U+{ord(ch):04x}' for ch in text)
return f"{text!r} ({chars})"
max_chr = 0x10_ffff
maxl = 0
maxu = 0
maxt = 0
for ch in range(0, max_chr+1):
ch = chr(ch)
output = ch.lower()
x = len(output)
if x > maxl:
maxl = x
print(f"New max lower()={maxl}: {dump(ch)} => {dump(output)}")
output = ch.upper()
x = len(output)
if x > maxu:
maxu = x
print(f"New max upper()={maxu}: {dump(ch)} => {dump(output)}")
output = ch.title()
x = len(output)
if x > maxt:
maxt = x
print(f"New max title()={maxt}: {dump(ch)} => {dump(output)}")
Ouptut (sorted manually):
New max lower()=1: '\x00' (U+0000) => '\x00' (U+0000)
New max lower()=2: 'İ' (U+0130) => 'i̇' (U+0069 U+0307)
New max upper()=1: '\x00' (U+0000) => '\x00' (U+0000)
New max upper()=2: 'ß' (U+00df) => 'SS' (U+0053 U+0053)
New max upper()=3: 'ΐ' (U+0390) => 'Ϊ́' (U+0399 U+0308 U+0301)
New max title()=1: '\x00' (U+0000) => '\x00' (U+0000)
New max title()=2: 'ß' (U+00df) => 'Ss' (U+0053 U+0073)
New max title()=3: 'ΐ' (U+0390) => 'Ϊ́' (U+0399 U+0308 U+0301)
Wow, I didn't know that upper-casing or lower-casing could lead to more than one characters being returned. Either way, the API looks good to me, and would definitely be okay to use in NumPy. Would you like me to work on it?
The implementation already exists:
int _PyUnicode_ToLowerFull(Py_UCS4 ch, Py_UCS4 *res)
int _PyUnicode_ToTitleFull(Py_UCS4 ch, Py_UCS4 *res)
int _PyUnicode_ToUpperFull(Py_UCS4 ch, Py_UCS4 *res)
We should write a public API on top of it which takes care of the buffer size.
Oh, case_operation() makes the expectation that a single character can only expand to 3 characters. So maybe 3 is enough.
@lysnikolaou: Sure, you can propose a PR to implement it.
I'm not sure if we should expose a "title" operation and "case folding" operation. In case of doubt, I propose to start with the minimum: upper and lower.
The issues with #117117:
- Buffer overflow if specify too small buffer.
- Returning -1 on failure loses information about what size of the buffer is required.
I think that the current C API is good as is, we only should make it public. We can also extend it to support NULL as res -- the result will say us what size of the buffer is needed. But this may be overkill, because the maximal size is not so large, we always can use an output buffer of static size.
We can provide a constant which is the minimum buffer size.
I've opened #136176 to move this forward. I did not add a constant, but I can certainly add it if you think it's worth it.