In windows prompts cannot be chinese refer to pathlib read_text
Description
In windows , prompts cannot be chinese , because pathlib read_text default to use gbk to read prompts file;
Proposed Changes
pathlib read_text add encoding=utf-8 params
Checklist
- [ *] I have tested these changes locally.
- [ *] I have reviewed the code changes.
- [ *] I have updated the documentation (if necessary).
- [ *] I have added appropriate unit tests (if applicable).
@microsoft-github-policy-service agree
@microsoft-github-policy-service agree
Thanks for the contribution @liseri! Should errors be set here as well? And do you think it would be more appropriate/helpful here to have errors=‘strict’ (which would raise an exception and exit when non-utf8 is encountered) or errors=‘ignore’?
Thanks for the contribution @liseri! Should
errorsbe set here as well? And do you think it would be more appropriate/helpful here to haveerrors=‘strict’(which would raise an exception and exit) orerrors=‘ignore’?
This is a common bug with the Windows system in China, caused by various factors, primarily summarized as follows: GPK, as a universal code, is widely used, but the Win32 interface at the bottom of Windows defaults to using system variables. This variable can be modified to other encodings, but it may cause some programs to crash because some programs do not use this Win32 interface when running. This issue is mostly resolved by upstream frameworks specifying an encoding parameter or by directly using binary reading.
In this case, binary reading should be used instead of file stream reading.
Thanks for the contribution @liseri! Should
errorsbe set here as well? And do you think it would be more appropriate/helpful here to haveerrors=‘strict’(which would raise an exception and exit when non-utf8 is encountered) orerrors=‘ignore’?
It is sufficient to set the utf-8 parameter ; because the prompts file is generated by graphrag itself in utf-8, ensuring that read_text reads in utf-8 will guarantee correctness; if the utf-8 parameter is not set for read_text, it may read using the operating system’s default character encoding (such as gbk encoding on Windows systems).
Thanks for the contribution @liseri! Should
errorsbe set here as well? And do you think it would be more appropriate/helpful here to haveerrors=‘strict’(which would raise an exception and exit when non-utf8 is encountered) orerrors=‘ignore’?
It is sufficient to set the utf-8 parameter ; because the prompts file is generated by graphrag itself in utf-8, ensuring that read_text reads in utf-8 will guarantee correctness; if the utf-8 parameter is not set for read_text, it may read using the operating system’s default character encoding (such as gbk encoding on Windows systems).
@liseri what are your thoughts on @glide-the 's comment? We could replace this by binary reading
@liseri what are your thoughts on @glide-the 's comment? We could replace this by binary reading
I think that solution is better; I just chose the simplest approach, as I didn’t want to implement something too complicated; as long as the issue can be resolved, I’m open to any solution; I’ll go ahead and close this pull later.
@liseri why close this ticket? it is a bigger issue for windows user...