Crashes with danish characters
Describe the bug Google Calendar outputs a file containting the line
ATTACH;FILENAME=Fødselsdag_40.pdf:https://someurl.com
This gives the error
File "/home/vscode/.local/lib/python3.12/site-packages/ical/parsing/parser.py", line 111, in parse_contentlines
return [parser.parse_string(line, parse_all=True) for line in lines if line]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vscode/.local/lib/python3.12/site-packages/pyparsing/core.py", line 1219, in parse_string
raise exc.with_traceback(None)
pyparsing.exceptions.ParseException: Expected ':', found 'ødselsdag' (at char 17), (line:1, col:18)
To Reproduce Create an ics file with the line above in an event and parse it.
Expected behavior 'ø' (and other) should be treated as a normal word character.
Additional context
Thanks, i'm able to reproduce this. I believe the issue may be in this code here: https://github.com/allenporter/ical/blob/69eb3ce7aa218a0b5107406541273ef5ae94d323/ical/parsing/unicode.py#L43 which does not seem to correctly implement the unicode ranges. I spent some time trying to make it work, but didn't have any luck so will need to take another pass.
I had a chance to try to fix this, but the performance was bad. Need to look for an efficient way to add support for all the unicode characters in pyparsing.
I had a chance to try to fix this, but the performance was bad. Need to look for an efficient way to add support for all the unicode characters in
pyparsing.
Could you post a branch with the fix (even if it's slow)?
I don't think i have it handy anymore, but the fix involves changing the lines here: https://github.com/allenporter/ical/blob/1c9540812c27ee5890e549dd5e97c182ce5d54fb/ical/parsing/unicode.py#L43 since these rangers are not correct. The discussion here https://github.com/pyparsing/pyparsing/discussions/491 talks about how to match all uncode.
The code is trying to capture rfc5545 spec here https://datatracker.ietf.org/doc/html/rfc5545#section-3.1 which describes character ranges here https://datatracker.ietf.org/doc/html/rfc3629:
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )
UTF8-tail = %x80-BF
However, that code in ical in unicode.py is not implementing that correctly at all since its matching a single character, not a sequence of characters.
I decided to start over and see if Gemini can help. Here is a solution it produces. I can try it again, but worry is that all these ranges are really slow:
from pyparsing import Char, Combine
# UTF8-tail = %x80-BF
UTF8_tail = Char(range(0x80, 0xC0))
# UTF8-2 = %xC2-DF UTF8-tail
UTF8_2 = Combine(Char(range(0xC2, 0xE0)) + UTF8_tail)
# UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
# %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8_3_part1 = Combine(Char(0xE0) + Char(range(0xA0, 0xC0)) + UTF8_tail)
UTF8_3_part2 = Combine(Char(range(0xE1, 0xED)) + UTF8_tail * 2)
UTF8_3_part3 = Combine(Char(0xED) + Char(range(0x80, 0xA0)) + UTF8_tail)
UTF8_3_part4 = Combine(Char(range(0xEE, 0xF0)) + UTF8_tail * 2)
UTF8_3 = UTF8_3_part1 | UTF8_3_part2 | UTF8_3_part3 | UTF8_3_part4
# UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
# %xF4 %x80-8F 2( UTF8-tail )
UTF8_4_part1 = Combine(Char(0xF0) + Char(range(0x90, 0xC0)) + UTF8_tail * 2)
UTF8_4_part2 = Combine(Char(range(0xF1, 0xF4)) + UTF8_tail * 3)
UTF8_4_part3 = Combine(Char(0xF4) + Char(range(0x80, 0x90)) + UTF8_tail * 2)
UTF8_4 = UTF8_4_part1 | UTF8_4_part2 | UTF8_4_part3
# NON-US-ASCII = UTF8-2 / UTF8-3 / UTF8-4
NON_US_ASCII = UTF8_2 | UTF8_3 | UTF8_4
# Example usage:
test_string_utf8_2 = b'\xc2\xa9'.decode('utf-8') # copyright symbol
test_string_utf8_3 = b'\xe2\x82\xac'.decode('utf-8') # Euro symbol
test_string_utf8_4 = b'\xf0\x9f\x98\x80'.decode('utf-8') # grinning face emoji
print(f"Matching '{test_string_utf8_2}' (UTF8-2):", NON_US_ASCII.parseString(test_string_utf8_2))
print(f"Matching '{test_string_utf8_3}' (UTF8-3):", NON_US_ASCII.parseString(test_string_utf8_3))
print(f"Matching '{test_string_utf8_4}' (UTF8-4):", NON_US_ASCII.parseString(test_string_utf8_4))
test_string_invalid = b'\xc1\x00'.decode('latin-1', errors='ignore') # Invalid UTF-8
try:
NON_US_ASCII.parseString(test_string_invalid)
except Exception as e:
print(f"Matching invalid string '{test_string_invalid}':", e)
I spent a few hours on this today and it didn't seem to work. I worry there is a limitation in pyparsing, so will next need to make a focused bug report for that library for additional help.