ModelicaStandardLibrary icon indicating copy to clipboard operation
ModelicaStandardLibrary copied to clipboard

Support Unicode Strings in MSL

Open HansOlsson opened this issue 3 years ago • 9 comments

Based on allowing Unicode strings in Modelica Language MO#3079

We should ideally also support that in MSL, as far as I can see there are two major issues:

  • [x] String handling should document that sub-string handling etc only work reliably when breaking at ASCII characters (and possibly at first UTF-8 character). To me that is primarily a documentation issue.
  • [ ] File handling routines should accept file names encoded using UTF-8. (This can occur before allowing Unicode strings in models since loadResource could reference a package stored in a Unicode-named directory.)

Notes:

  • The code in ModelicaStrings.c uses proper casts to (unsigned char) for isalpha etc, so it should work.
  • Thus using Modelica.Utilities.Strings.scanToken("\"€2ÅÄÖ\""); does actually work already.
  • If the intended use is to produce good text-tables of results and substrings are used for alignment/truncating it will fail miserably for Unicode and one would need significantly more advanced string-handling. Merely counting Unicode-characters isn't enough due to combining characters. A good work-around is to generate HTML-tables instead.

HansOlsson avatar Feb 10 '22 09:02 HansOlsson

Can this be seen as follow-up for #3789?

beutlich avatar Feb 10 '22 15:02 beutlich

I would say this issue represents the minimum, whereas #3789 extends this to other cases.

HansOlsson avatar Feb 10 '22 16:02 HansOlsson

As I see it, it means that the the MSL string handling functions are actually fine as they are as long as the targeted Modelica language version doesn't exceed 3.5, but that one break these functions by switching to a newer MLS version without making sure that the functions operate on something more sensible than bytes.

Fortunately, the string handling functions could be updated for UTF-8 already today, as they would remain valid also under the constraint that they only operate on ASCII strings. That is, they would remain compatible with the current target MLS 3.4, as well as both 3.5 and future versions. A minor concern would be that making them UTF-8 ready would encourage invalid use as long as the targeted MLS version doesn't exceed 3.5.

henrikt-ma avatar Feb 10 '22 22:02 henrikt-ma

I believe this part hasn't been done yet, but I'm happy to be proven wrong:

File handling routines should accept file names encoded using UTF-8.

Reopening.

maltelenz avatar Mar 03 '23 12:03 maltelenz

I believe this part hasn't been done yet, but I'm happy to be proven wrong:

File handling routines should accept file names encoded using UTF-8.

Reopening.

As far as I understand it will likely work without changes for *nix-variants.

For Windows there are two options:

  • Convert UTF-8 to UCS-2 or UTF-16 (or what-ever it actually is) and use Wide-variant of APIs.
  • Or set some special flags for the application: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

HansOlsson avatar Mar 03 '23 14:03 HansOlsson

For the manifest, it would then only work for updated OSes, and only if the tool compiling the executable sets that flag. MultiByteToWideChar is pretty simple to use.

sjoelund avatar Mar 03 '23 14:03 sjoelund

@HansOlsson what do you propose as the next work plan?

TManikantan avatar Apr 21 '23 03:04 TManikantan

@HansOlsson @MartinOtter second part of the issue which is unaddressed, would you please look into it?

TManikantan avatar May 25 '23 11:05 TManikantan

I have not enough knowledge to have an opinion or contribute here. @HansOlsson, @sjoelund please give advice how to continue/make a pull request

MartinOtter avatar Jun 15 '23 08:06 MartinOtter