spamassassin
spamassassin copied to clipboard
Improve A-Z replace_tag definitions
Problems with old definitions:
- Tries to match UTF-8 and Latin-1 characters in same expression. e.g. <A> includes the byte sequence for "ã" in Latin-1 (\xE3) and UTF-8 (\xC3\xA3). This seems like a good thing at first but it can cause false positives if the text is in UTF-8 and the pattern is looking for Latin-1
- Contains redundant characters. e.g. \xE3 appears multiple times in <A>
- Contains unnecessary characters. e.g. \xE3 also appears in <V> and <Y>
- Patterns are case-insensitive. e.g. <I> attempts to match lowercase L but because it's case-insensitive, it also matches uppercase L
- Some look-alike characters aren't matched e.g. \xEA\x93\xAE = LISU LETTER A (U+A4EE)
Changes:
- All byte sequences are UTF-8 only (no Latin-1)
- All patterns are case-sensitive
- Removed redundant and unnecessary characters
- Added additional look-alike characters
Note: these definitions are based on work I did on Text::ASCII::Convert
You did a great job but i think removing ISO-8859 can't be done. The documentation say:
When using rules with extended characters / diacritics, you should always use both ISO-8859-1 / UTF-8 encodings.
Body content can be different depending on normalize_charset setting.