ReadStat icon indicating copy to clipboard operation
ReadStat copied to clipboard

Unable to convert string to the requested encoding when reading sav files with long strings

Open ofajardo opened this issue 4 years ago • 6 comments

Hi,

When reading a sav file that contains a long string (756 characters to be precise, with 755 the error does not show up) with an international character, Readstat gives the error:

Unable to convert string to the requested encoding (invalid byte sequence)

Attached an example save file. The sav file was produced with pyreadstat.

thanks in advance!

original report: https://github.com/Roche/pyreadstat/issues/128

note: initially I reported the error was on writing, it is on reading!

eg.sav.zip

also attached a csv version of the file

eg.csv

ofajardo avatar Apr 23 '21 10:04 ofajardo

another observation is that a very similar file with only one character of difference (first variable name "aaaaa3" instead of "aaaaa2") does not raise the error, attached example file. eg3.sav.zip

ofajardo avatar Apr 23 '21 12:04 ofajardo

Are UTF-8 strings being provided to the writer?

evanmiller avatar Apr 23 '21 19:04 evanmiller

Yes

ofajardo avatar Apr 23 '21 19:04 ofajardo

as mentioned in #260, it is possible to reproduce this error without any international character, (using only 'a's in this example) if the length of the string is at least 757. Another important thing to reproduce this is that the numerical values must be NANs. See #260 for C code to reproduce the issue.

ofajardo avatar Dec 15 '21 16:12 ofajardo