Support Unicode?
So this is a longshot and might not be easy due to Lua, but, I am trying to write a program that parses text files that are made up of, among other things, Greek letters. I had to write a hack to allow myself to parse λs by reading two bytes out of a string... but... I can't write hacks for the general case.
Is it possible to make it effortless to use Unicode characters in strings? Or at least, if not effortless, easy? I honestly have no knowledge of what would need to be done to support this.
For reference, this is the project I've been working on: https://git.sci4me.com/sci4me/lambda_calculus
EDIT: Really I just need to use the UTF-8 library, so, ... my concern with that is that it's not supported on LuaJIT. (And I have no idea how to do it and it seems nontrivial to change all of my code to support it but idk)
Hello,
Unicode or just UTF-8 is a little nightmare. Before Lua 5.3 was released I wrote my own utf8 solution : lua-utf8. It is a pure lua code. It mainly split each character in a table (each item is a character) and make an object to use it like a lua string : see https://github.com/tst2005/lua-utf8#sample-of-use. My solution is not the only one that exists but if your main problem is only get each utf8 character it should works! Regards,
@tst2005 Thank you for that. I will try to use your code and see if I can solve my problem.
In the context of Urn, what I would love to see is the same thing that was done for the bit library except for the utf8 library; an Urn implementation that will be used if the utf8 library (from Lua 5.3) isn't available. Might be a ton to ask for but it would provide VM agnosticism and the potential performance benefit if Lua's utf8 library is available.
@sci4me Just take care my utf8 module don't have the same API than the lua 5.3 utf8. Lua 5.3 utf8 is a (very) low level functions set. My module is a high level abstraction that try to follow the string module api.
My TODO for my lua-utf8 is :
- rename the module to avoid conflict
- internally use the native lua 5.3 utf8 functions if available
I'm also working on a solution to have an "universal bit library", but it is really harder than utf8 because there are different implementations and different API. I started to compare them (which function is the same name, different name, is missing : see https://github.com/tst2005/lua-mini/blob/dev/BIT.md). I also see some difference between the existing implementations :
- the size supported (32bits vs 64bits) (lua 5.3's bit32 is 32 bits, luajit's bitop depends of the VM architectur, it can be 64bits, lua 5.3 native op, I dunno)
- (maybe) the behavior when the limit is reached
- things that I don't remind