Invalid parsing of unicode in strings
Hello,
we spotted a several .mat files that use unicode in strings, such as the one here:
https://github.com/SysBioChalmers/Fruitfly-GEM/blob/main/model/Fruitfly-GEM.mat
Strings in these files contain non-ascii characters such as α and β, unfortunately produce something like:
"\x03-Est1"
"\x03-Est10"
"\x03-Est2"
"\x03-Est3"
"\x03-Est4"
"\x03-Est5"
"\x03-Est6"
(this is in the "genes" sub-array).
Is there any way to specify the decoding of strings or any other way to fix this?
Thank you!
-mk
cc: @htpusa @laurentheirendt
Note this is purely a MAT v7 thing for certain characters like α and β. When you save with save('test.mat', 'str', '-v7.3') it works fine. When you try to save with save('test.mat', 'str', '-v6') you get an error in MATLAB (Found characters the default encoding is unable to represent.)
The issue is here in MAT_v5.jl:305
if 255 < convert(UInt32, char)
# Newer versions of MATLAB seem to write some mongrel UTF-8...
char = String([truncate_to_uint8(chars[i] >> 8), truncate_to_uint8(chars[i])])[1]
end
It looks like its dropping the last 8 bits?
I'm not very familiar with char encodings, but MATLAB documentation says dtype=17 contains Unicode UTF-16 Encoded Character Data, so I think the logic here must be tweaked slightly to separately account for dtype=17
For Char type variable, the encoding is different between V6 and V7, while V6 use uint16 and V7 use utf-8. For the String type variable, the actual data content in Cell 3 using utf-16 by default, I think the '1' after Uint64 tag and data length stands for utf-16 marker, default value is '1'.