Support Unicode Strings in MSL
Based on allowing Unicode strings in Modelica Language MO#3079
We should ideally also support that in MSL, as far as I can see there are two major issues:
- [x] String handling should document that sub-string handling etc only work reliably when breaking at ASCII characters (and possibly at first UTF-8 character). To me that is primarily a documentation issue.
- [ ] File handling routines should accept file names encoded using UTF-8. (This can occur before allowing Unicode strings in models since loadResource could reference a package stored in a Unicode-named directory.)
Notes:
- The code in ModelicaStrings.c uses proper casts to
(unsigned char)forisalphaetc, so it should work. - Thus using
Modelica.Utilities.Strings.scanToken("\"€2ÅÄÖ\"");does actually work already. - If the intended use is to produce good text-tables of results and substrings are used for alignment/truncating it will fail miserably for Unicode and one would need significantly more advanced string-handling. Merely counting Unicode-characters isn't enough due to combining characters. A good work-around is to generate HTML-tables instead.
Can this be seen as follow-up for #3789?
I would say this issue represents the minimum, whereas #3789 extends this to other cases.
As I see it, it means that the the MSL string handling functions are actually fine as they are as long as the targeted Modelica language version doesn't exceed 3.5, but that one break these functions by switching to a newer MLS version without making sure that the functions operate on something more sensible than bytes.
Fortunately, the string handling functions could be updated for UTF-8 already today, as they would remain valid also under the constraint that they only operate on ASCII strings. That is, they would remain compatible with the current target MLS 3.4, as well as both 3.5 and future versions. A minor concern would be that making them UTF-8 ready would encourage invalid use as long as the targeted MLS version doesn't exceed 3.5.
I believe this part hasn't been done yet, but I'm happy to be proven wrong:
File handling routines should accept file names encoded using UTF-8.
Reopening.
I believe this part hasn't been done yet, but I'm happy to be proven wrong:
File handling routines should accept file names encoded using UTF-8.
Reopening.
As far as I understand it will likely work without changes for *nix-variants.
For Windows there are two options:
- Convert UTF-8 to UCS-2 or UTF-16 (or what-ever it actually is) and use Wide-variant of APIs.
- Or set some special flags for the application: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
For the manifest, it would then only work for updated OSes, and only if the tool compiling the executable sets that flag. MultiByteToWideChar is pretty simple to use.
@HansOlsson what do you propose as the next work plan?
@HansOlsson @MartinOtter second part of the issue which is unaddressed, would you please look into it?
I have not enough knowledge to have an opinion or contribute here. @HansOlsson, @sjoelund please give advice how to continue/make a pull request