[Feature] Multibyte character matching
I'm sure this issue is well-known, but I couldn't find an existing issue articulating solutions.
The following scanner demonstrates the beginner-unfriendly behavior even when the appropriate locale is enabled.
%top{
#define _POSIX_C_SOURCE 200809L
}
%{
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
%}
%option noyywrap
%%
. {
if(printf("%s, yyleng = %d\n", yytext, yyleng) < 0) {
perror("Failed to print character");
exit(EXIT_FAILURE);
}
}
%%
int main(void) {
if(!setlocale(LC_ALL, "")) {
fputs("Failed to enable default locale\n", stderr);
exit(EXIT_FAILURE);
}
yylex();
}
It appears that the period character . matches bytes, not characters, although there doesn't seem any reason except for backwards compatibility that flex cannot use mblen, mbsrtowcs, and friends to do the right thing. On a system with a UTF-8 locale,
$ printf "🐻" | ./a.out
�, yyleng = 1
�, yyleng = 1
�, yyleng = 1
�, yyleng = 1
I would have expected this to print the bear character correctly and set yyleng = 4. Perhaps an %option mbs could be introduced to get the latter behavior? Otherwise it's unclear what the recommended solution is for using Flex to match characters, not bytes in the current locale.