spamassassin Improve A-Z replace

Problems with old definitions:

Tries to match UTF-8 and Latin-1 characters in same expression. e.g. <A> includes the byte sequence for "ã" in Latin-1 (\xE3) and UTF-8 (\xC3\xA3). This seems like a good thing at first but it can cause false positives if the text is in UTF-8 and the pattern is looking for Latin-1
Contains redundant characters. e.g. \xE3 appears multiple times in <A>
Contains unnecessary characters. e.g. \xE3 also appears in <V> and <Y>
Patterns are case-insensitive. e.g. <I> attempts to match lowercase L but because it's case-insensitive, it also matches uppercase L
Some look-alike characters aren't matched e.g. \xEA\x93\xAE = LISU LETTER A (U+A4EE)

Changes:

All byte sequences are UTF-8 only (no Latin-1)
All patterns are case-sensitive
Removed redundant and unnecessary characters
Added additional look-alike characters

Oct 08 '25 01:10 fkoyer

Note: these definitions are based on work I did on Text::ASCII::Convert

Oct 08 '25 01:10 fkoyer

You did a great job but i think removing ISO-8859 can't be done. The documentation say:

When using rules with extended characters / diacritics, you should always use both ISO-8859-1 / UTF-8 encodings.
Body content can be different depending on normalize_charset setting.

Oct 17 '25 12:10 Fneufneu

Improve A-Z replace_tag definitions