Regression on parsing invalid URLs

Open kamil-certat opened this issue 2 years ago • 0 comments

As a continuation of #2377, we have a regression on parsing invalid URLs. Previously, the urllib was mach more liberal in processing URLs, now it rejects much more cases.

We use it for sanitize the URLs, and html_parser is an example of bot that uses the liberal behavior in tests:

https://github.com/certtools/intelmq/blob/61c45acfb8cc60e1419abe7c57691561ef9ee072/intelmq/tests/bots/parsers/html_table/test_parser_column_split.py#L47

https://github.com/certtools/intelmq/blob/61c45acfb8cc60e1419abe7c57691561ef9ee072/intelmq/tests/bots/parsers/html_table/test_parser_column_split.py#L73-L80

In patched Python versions (e.g. 3.11.4), this URL is rejected. We need to either decide against allowing such URLs, or redesign our sanitization.

Temporally, the test is skipped to unlock other work.

Jun 22 '23 11:06 kamil-certat