MAT.jl icon indicating copy to clipboard operation
MAT.jl copied to clipboard

Invalid parsing of unicode in strings

Open exaexa opened this issue 3 years ago • 3 comments

Hello,

we spotted a several .mat files that use unicode in strings, such as the one here:

https://github.com/SysBioChalmers/Fruitfly-GEM/blob/main/model/Fruitfly-GEM.mat

Strings in these files contain non-ascii characters such as α and β, unfortunately produce something like:

 "\x03-Est1"
 "\x03-Est10"
 "\x03-Est2"
 "\x03-Est3"
 "\x03-Est4"
 "\x03-Est5"
 "\x03-Est6"

(this is in the "genes" sub-array).

Is there any way to specify the decoding of strings or any other way to fix this?

Thank you!

-mk

cc: @htpusa @laurentheirendt

exaexa avatar Jan 26 '23 08:01 exaexa

Note this is purely a MAT v7 thing for certain characters like α and β. When you save with save('test.mat', 'str', '-v7.3') it works fine. When you try to save with save('test.mat', 'str', '-v6') you get an error in MATLAB (Found characters the default encoding is unable to represent.)

matthijscox avatar Nov 21 '25 14:11 matthijscox

The issue is here in MAT_v5.jl:305

if 255 < convert(UInt32, char)
    # Newer versions of MATLAB seem to write some mongrel UTF-8...
    char = String([truncate_to_uint8(chars[i] >> 8), truncate_to_uint8(chars[i])])[1]
end

It looks like its dropping the last 8 bits?

I'm not very familiar with char encodings, but MATLAB documentation says dtype=17 contains Unicode UTF-16 Encoded Character Data, so I think the logic here must be tweaked slightly to separately account for dtype=17

foreverallama avatar Nov 29 '25 15:11 foreverallama

For Char type variable, the encoding is different between V6 and V7, while V6 use uint16 and V7 use utf-8. For the String type variable, the actual data content in Cell 3 using utf-16 by default, I think the '1' after Uint64 tag and data length stands for utf-16 marker, default value is '1'.

MegaShark1911 avatar Nov 30 '25 03:11 MegaShark1911