cpython Reduce memory usage of urllib.unquote and unquote_to

BPO 44334

Nosy @terryjreedy, @gpshead, @orsenthil, @mustafaelagamey

PRs
python/cpython#26576

BPO	44334
Nosy	@terryjreedy, @gpshead, @orsenthil, @mustafaelagamey
PRs	python/cpython#26576

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-06-07.15:07:38.990>
labels = ['extension-modules', '3.11', '3.9', '3.10', 'performance']
title = 'Use bytearray in urllib.unquote_to_bytes'
updated_at = <Date 2021-06-07.20:09:11.522>
user = 'https://github.com/mustafaelagamey'

bugs.python.org fields:

activity = <Date 2021-06-07.20:09:11.522>
actor = 'gregory.p.smith'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Extension Modules']
creation = <Date 2021-06-07.15:07:38.990>
creator = 'eng.mustafaelagamey'
dependencies = []
files = []
hgrepos = []
issue_num = 44334
keywords = ['patch']
message_count = 2.0
messages = ['395280', '395281']
nosy_count = 4.0
nosy_names = ['terry.reedy', 'gregory.p.smith', 'orsenthil', 'eng.mustafaelagamey']
pr_nums = ['26576']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue44334'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

PR: gh-96763

Jun 07 '21 15:06 ad186d5a-3642-4a78-96eb-191b051d514d

'eng' claimed in original title that "urllib.parse.parse_qsl cannot parse large data". On original PR, said problem with 6-7 millions bytes.

Claim should be backed up by a generated example that fails with original code and succeeds with new code. Claims of 'faster' also needs some examples.

Original PRs must nearly all propose merging a branch created from main into main. Performance enhancements are often not backported.

Jun 07 '21 19:06 terryjreedy

fwiw this sort of thing may be reasonable to backport to 3.9 as it is more than just a performance enhancement but also a resource consumption bug and should result in no behavior change.

""" In case of form contain very large data ( in my case the string to parse was about 6000000 byte ) Old code use list of bytes during parsing consumes a lot of memory New code will use bytearry , which use less memory """ - text from the original PR

Jun 07 '21 20:06 gpshead

The PR was closed due to technicalities (pointing to the wrong branch, CLA) and the OP didn’t follow up.

Unless someone object I will close this issue as well.

Sep 11 '22 10:09 iritkatriel

I created a new PR and included fixing a similar legacy design issue in unquote() as well as the original report's unquote_to_bytes(). Some performance microbenchmarks need running before I'll consider moving forward with it.

If someone wanted to consider this a security issue it could be backported. It is at most a fixed constant factor (roughly $len(input) * sizeof(PyObject)$ memory consumption vs a maximally antagonistic input though. That doesn't smell DoS worthy.

Sep 12 '22 08:09 gpshead

Reduce memory usage of urllib.unquote and unquote_to_bytes