ical icon indicating copy to clipboard operation
ical copied to clipboard

Crashes with danish characters

Open AThomsen opened this issue 1 year ago • 6 comments

Describe the bug Google Calendar outputs a file containting the line

ATTACH;FILENAME=Fødselsdag_40.pdf:https://someurl.com

This gives the error

  File "/home/vscode/.local/lib/python3.12/site-packages/ical/parsing/parser.py", line 111, in parse_contentlines
    return [parser.parse_string(line, parse_all=True) for line in lines if line]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.12/site-packages/pyparsing/core.py", line 1219, in parse_string
    raise exc.with_traceback(None)
pyparsing.exceptions.ParseException: Expected ':', found 'ødselsdag'  (at char 17), (line:1, col:18)

To Reproduce Create an ics file with the line above in an event and parse it.

Expected behavior 'ø' (and other) should be treated as a normal word character.

Additional context

AThomsen avatar Apr 18 '25 13:04 AThomsen

Thanks, i'm able to reproduce this. I believe the issue may be in this code here: https://github.com/allenporter/ical/blob/69eb3ce7aa218a0b5107406541273ef5ae94d323/ical/parsing/unicode.py#L43 which does not seem to correctly implement the unicode ranges. I spent some time trying to make it work, but didn't have any luck so will need to take another pass.

allenporter avatar Apr 20 '25 02:04 allenporter

I had a chance to try to fix this, but the performance was bad. Need to look for an efficient way to add support for all the unicode characters in pyparsing.

allenporter avatar May 12 '25 05:05 allenporter

I had a chance to try to fix this, but the performance was bad. Need to look for an efficient way to add support for all the unicode characters in pyparsing.

Could you post a branch with the fix (even if it's slow)?

AThomsen avatar May 20 '25 05:05 AThomsen

I don't think i have it handy anymore, but the fix involves changing the lines here: https://github.com/allenporter/ical/blob/1c9540812c27ee5890e549dd5e97c182ce5d54fb/ical/parsing/unicode.py#L43 since these rangers are not correct. The discussion here https://github.com/pyparsing/pyparsing/discussions/491 talks about how to match all uncode.

The code is trying to capture rfc5545 spec here https://datatracker.ietf.org/doc/html/rfc5545#section-3.1 which describes character ranges here https://datatracker.ietf.org/doc/html/rfc3629:

   UTF8-2      = %xC2-DF UTF8-tail
   UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
                 %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
   UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
                 %xF4 %x80-8F 2( UTF8-tail )
   UTF8-tail   = %x80-BF

However, that code in ical in unicode.py is not implementing that correctly at all since its matching a single character, not a sequence of characters.

allenporter avatar May 20 '25 14:05 allenporter

I decided to start over and see if Gemini can help. Here is a solution it produces. I can try it again, but worry is that all these ranges are really slow:

from pyparsing import Char, Combine

# UTF8-tail   = %x80-BF
UTF8_tail = Char(range(0x80, 0xC0))

# UTF8-2      = %xC2-DF UTF8-tail
UTF8_2 = Combine(Char(range(0xC2, 0xE0)) + UTF8_tail)

# UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
#                 %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8_3_part1 = Combine(Char(0xE0) + Char(range(0xA0, 0xC0)) + UTF8_tail)
UTF8_3_part2 = Combine(Char(range(0xE1, 0xED)) + UTF8_tail * 2)
UTF8_3_part3 = Combine(Char(0xED) + Char(range(0x80, 0xA0)) + UTF8_tail)
UTF8_3_part4 = Combine(Char(range(0xEE, 0xF0)) + UTF8_tail * 2)
UTF8_3 = UTF8_3_part1 | UTF8_3_part2 | UTF8_3_part3 | UTF8_3_part4

# UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
#                 %xF4 %x80-8F 2( UTF8-tail )
UTF8_4_part1 = Combine(Char(0xF0) + Char(range(0x90, 0xC0)) + UTF8_tail * 2)
UTF8_4_part2 = Combine(Char(range(0xF1, 0xF4)) + UTF8_tail * 3)
UTF8_4_part3 = Combine(Char(0xF4) + Char(range(0x80, 0x90)) + UTF8_tail * 2)
UTF8_4 = UTF8_4_part1 | UTF8_4_part2 | UTF8_4_part3

# NON-US-ASCII  = UTF8-2 / UTF8-3 / UTF8-4
NON_US_ASCII = UTF8_2 | UTF8_3 | UTF8_4

# Example usage:
test_string_utf8_2 = b'\xc2\xa9'.decode('utf-8')  # copyright symbol
test_string_utf8_3 = b'\xe2\x82\xac'.decode('utf-8')  # Euro symbol
test_string_utf8_4 = b'\xf0\x9f\x98\x80'.decode('utf-8')  # grinning face emoji

print(f"Matching '{test_string_utf8_2}' (UTF8-2):", NON_US_ASCII.parseString(test_string_utf8_2))
print(f"Matching '{test_string_utf8_3}' (UTF8-3):", NON_US_ASCII.parseString(test_string_utf8_3))
print(f"Matching '{test_string_utf8_4}' (UTF8-4):", NON_US_ASCII.parseString(test_string_utf8_4))

test_string_invalid = b'\xc1\x00'.decode('latin-1', errors='ignore') # Invalid UTF-8
try:
    NON_US_ASCII.parseString(test_string_invalid)
except Exception as e:
    print(f"Matching invalid string '{test_string_invalid}':", e)

allenporter avatar May 20 '25 14:05 allenporter

I spent a few hours on this today and it didn't seem to work. I worry there is a limitation in pyparsing, so will next need to make a focused bug report for that library for additional help.

allenporter avatar May 21 '25 05:05 allenporter