flex icon indicating copy to clipboard operation
flex copied to clipboard

[Feature] Multibyte character matching

Open jscott0 opened this issue 3 years ago • 0 comments

I'm sure this issue is well-known, but I couldn't find an existing issue articulating solutions.

The following scanner demonstrates the beginner-unfriendly behavior even when the appropriate locale is enabled.

%top{
#define _POSIX_C_SOURCE 200809L
}

%{
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
%}

%option noyywrap

%%
. {
	if(printf("%s, yyleng = %d\n", yytext, yyleng) < 0) {
		perror("Failed to print character");
		exit(EXIT_FAILURE);
	}
}

%%

int main(void) {
	if(!setlocale(LC_ALL, "")) {
		fputs("Failed to enable default locale\n", stderr);
		exit(EXIT_FAILURE);
	}
	yylex();
}

It appears that the period character . matches bytes, not characters, although there doesn't seem any reason except for backwards compatibility that flex cannot use mblen, mbsrtowcs, and friends to do the right thing. On a system with a UTF-8 locale,

$ printf "🐻" | ./a.out
�, yyleng = 1
�, yyleng = 1
�, yyleng = 1
�, yyleng = 1

I would have expected this to print the bear character correctly and set yyleng = 4. Perhaps an %option mbs could be introduced to get the latter behavior? Otherwise it's unclear what the recommended solution is for using Flex to match characters, not bytes in the current locale.

jscott0 avatar Mar 21 '22 09:03 jscott0