es-shell Locale-related variables have no effect

On POSIX systems, environment variables such as LC_COLLATE influence the behavior of the setlocale() function in C, with the LC_ALL environment variable taking precedence over any other LC_xxx environment variables. The LANG environment variable is used as a fallback in the event that one of the specific LC_xxx environment variables are unset. See the Internationalization Variables section in SUSv3 for more information about how these environment variables can affect things.

The LC_COLLATE category matters when working with readline's filename completion and when filenames are sorted by utilities like ls. It also affects the POSIX glob() function and wildcard expansion as a result. For example, a partially patched es might do things like this:

; local (LC_COLLATE=C) { ls ; echo * }
A B a b
A B a b
; local (LC_COLLATE=en_US.UTF-8) { ls ; echo * }
a A b B    # ok - external binary
A B a b    # not ok - wildcard expansion should be affected

Similarly, functions like isprint() in Sconv() are affected by the LC_CTYPE category, and the strings accepted as "yes" or "no" by some utilities may be affected by the LC_MESSAGES category:

; locale yesstr
yes:y:YES:Y
; local (LC_MESSAGES = de_DE.UTF-8) { locale yesstr }
ja:j:JA:J:yes:y:YES:Y

This is one of those hidden corners of shell implementation that nobody thinks about until it rears its head.

Mar 24 '25 04:03 memreflect

It looks to me like the only place LC_COLLATE could have an effect in es is in qstrcmp, where the fix is to simply replace strcmp with strcoll (That's also the only place we use strcmp that isn't just testing for equality). Super easy.

isprint() is tricker because, as you point out in #175, we have to account for wide characters, which adds a whole mess of complexity.

ISTM we could go piecemeal -- add setlocale and strcoll in one change and then fix isprint in another. (AFAICT, LC_MESSAGES doesn't actually affect anything es does, though maybe I'm missing something?)

If we went through the effort of adding at least some wchar_t support, then I'm curious what else could be done for localization. I guess there's text localization for exceptions and other error messages (frankly, not highly likely for us to pursue any time soon). Is there something that could reasonably be done to better support parsing code written in other (especially non-ASCII-compatible, like UTF-8) encodings? I'm just spitballing.

Mar 31 '25 20:03 jpco

It looks to me like the only place LC_COLLATE could have an effect in es is in qstrcmp, where the fix is to simply replace strcmp with strcoll (That's also the only place we use strcmp that isn't just testing for equality). Super easy.

And with filename completion when readline is enabled. Environment variables like LC_COLLATE can be changed at run-time, but setting LC_COLLATE alone isn't enough currently because the shell doesn't actually call setlocale() just by setting that variable.

isprint() is tricker because, as you point out in #175, we have to account for wide characters, which adds a whole mess of complexity.

ISTM we could go piecemeal -- add setlocale and strcoll in one change and then fix isprint in another.

Yes, isprint() is definitely a separate, yet related issue, and tackling it separately is probably a good idea. Again, modifying LC_CTYPE at run-time can affect the output of var for example, but currently just setting LC_CTYPE in an interactive session or a script doesn't change the results.

(AFAICT, LC_MESSAGES doesn't actually affect anything es does, though maybe I'm missing something?)

I think that's correct. Readline apparently doesn't use it for messages like "Display all NN possibilities? (y or n)", thankfully. GNU Bison also seems to require some special set-up with GNU Gettext to deliver messages like "syntax error" in one's native language as well, which es doesn't offer as an option. In short, there's currently nothing in es that depends on it, and setting LC_MESSAGES is enough for an external program to take advantage of it if capable/desired.

If we went through the effort of adding at least some wchar_t support, then I'm curious what else could be done for localization. I guess there's text localization for exceptions and other error messages (frankly, not highly likely for us to pursue any time soon).

I think my brain was just stuck in thinking "Setting LC_* should be calling setlocale() if it's one of the standard locale categories," so you can forget i mentioned it. I don't consider that feature critically important either.

Is there something that could reasonably be done to better support parsing code written in other (especially non-ASCII-compatible, like UTF-8) encodings? I'm just spitballing.

The dnw array explicitly forbids variable names that start with a byte value of 0x7B or greater, meaning you cannot use $😎 for example because it starts with 0xF0. But dnw is ignored if you quote the variable name (not sure if this is a bug in both rc and es, but it works to our benefit):

; 😎 = sunny!
; echo 'Today''s forecast is:' $'😎'
Today's forecast is: sunny!

And because fn-😎 starts with f, it is already a valid variable name, meaning you can use $fn-😎 without quoting (yes, executing the 😎 function works.)

I wonder what would happen if the dnw array was dropped entirely in favor of nw, which doesn't forbid bytes beyond ASCII... Or maybe it's better to find a way to sneak is[w]alpha() in there?

Apr 01 '25 06:04 memreflect

I wonder what would happen if the dnw array was dropped entirely in favor of nw, which doesn't forbid bytes beyond ASCII...

The differences between dnw and nw in the 7-bit ASCII range make me think that would cause issues. Easier would be to just clear all the dnw bits above 127, though that certainly has some edge cases -- for example, non-ASCII spaces would start being considered part of variable names by default. But that's already a problem outside of variable names, so I don't think it would cost much in practical use.

; fn-greet = echo hi
: ~/git/es-shell; greet buddy
hi buddy
: ~/git/es-shell; greet buddy  # greet<U+2002>buddy, "en space"
greet buddy: No such file or directory

Surely nobody is trying to actually use es with non-ASCII spaces if they're handled like this...

Properly handling non-ASCII whitespace (both word breaks and line breaks) sounds like a neat longer-term goal. I wonder how far that could go, though. Would it be useful to handle non-ASCII curly braces? Non-ASCII parentheses? (Actually, I do think it would be neat to be able to use a real λ for lambda expressions...) Would we try to print these parsed syntax characters back out in their original form, or as their "understood" ASCII representations? Things sure do start to get tricky.

Apr 01 '25 18:04 jpco

https://github.com/rakitzis/rc/commit/864cac55e8c0796a168762fd73450b764775f34e is relevant to this -- there needs to be some way to set the locale after running .esrc. In rc, they just put the call to setlocale() later than startup, but in es we might consider doing something with a settor function primitive like $&setlocale.

May 05 '25 14:05 jpco

I hacked up a half-working demo. I think this design, in the form of an initial.es excerpt, is nice:

#	The $&setlocale primitive is used to update the shell's internal idea
#	of the current locale.  There are several locale environment variables
#	but not all such variables affect shell execution, so we define settor
#	functions for a subset of them.

set-LANG	= $&setlocale LANG
set-LC_ALL	= $&setlocale LC_ALL
set-LC_COLLATE	= $&setlocale LC_COLLATE
set-LC_CTYPE	= $&setlocale LC_CTYPE
set-LC_MESSAGES	= $&setlocale LC_MESSAGES

Called with no arguments, $&setlocale is just setlocale(LC_ALL, "").

More of a challenge is actual implementation. The quick-and-easy way, and the way that keeps us out of having to actually understand locale semantics too much is just do something like

setenv("LC_WHATEVER", "whatever_value");
result = setlocale(LC_ALL, "");
setenv("LC_WHATEVER", "previous_value");
handle(result);

This lets setlocale handle all the details of which locale categories exist on the system, which environment variables affect what, and everything else, which is great. Unfortunately our hacked *env() functions are directly in our own way -- setlocale() doesn't seem to respect our hacked getenv() at all. (I wouldn't be surprised if no libc functions did, due to things to do with linking that I don't understand.) So here is an implementation that does seem to work, but involves some environ-hacking that is... terrible (and also disallowed by POSIX, see getenv(3p)):

extern char **environ;

PRIM(setlocale) {
	Push sp, p;
	char *r;
	Vector *env;
	char **oe = environ;
	if (list != NULL) {
		varpush(&sp, str("set-%s", getstr(list->term)), NULL);
		varpush(&p, getstr(list->term), list->next);
	}
	env = mkenv();
	environ = env->vector;
	r = setlocale(LC_ALL, "");
	environ = oe;
	if (list != NULL) {
		varpop(&p);
		varpop(&sp);
	}
	if (r == NULL)
		return NULL;
	return list->next;
}

May 05 '25 16:05 jpco