spamassassin icon indicating copy to clipboard operation
spamassassin copied to clipboard

Improve A-Z replace_tag definitions

Open fkoyer opened this issue 3 months ago • 2 comments

Problems with old definitions:

  • Tries to match UTF-8 and Latin-1 characters in same expression. e.g. <A> includes the byte sequence for "ã" in Latin-1 (\xE3) and UTF-8 (\xC3\xA3). This seems like a good thing at first but it can cause false positives if the text is in UTF-8 and the pattern is looking for Latin-1
  • Contains redundant characters. e.g. \xE3 appears multiple times in <A>
  • Contains unnecessary characters. e.g. \xE3 also appears in <V> and <Y>
  • Patterns are case-insensitive. e.g. <I> attempts to match lowercase L but because it's case-insensitive, it also matches uppercase L
  • Some look-alike characters aren't matched e.g. \xEA\x93\xAE = LISU LETTER A (U+A4EE)

Changes:

  • All byte sequences are UTF-8 only (no Latin-1)
  • All patterns are case-sensitive
  • Removed redundant and unnecessary characters
  • Added additional look-alike characters

fkoyer avatar Oct 08 '25 01:10 fkoyer

Note: these definitions are based on work I did on Text::ASCII::Convert

fkoyer avatar Oct 08 '25 01:10 fkoyer

You did a great job but i think removing ISO-8859 can't be done. The documentation say:

When using rules with extended characters / diacritics, you should always use both ISO-8859-1 / UTF-8 encodings.
Body content can be different depending on normalize_charset setting.

Fneufneu avatar Oct 17 '25 12:10 Fneufneu