graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

In windows prompts cannot be chinese refer to pathlib read_text

Open liseri opened this issue 1 year ago • 6 comments

Description

In windows , prompts cannot be chinese , because pathlib read_text default to use gbk to read prompts file;

Proposed Changes

pathlib read_text add encoding=utf-8 params

Checklist

  • [ *] I have tested these changes locally.
  • [ *] I have reviewed the code changes.
  • [ *] I have updated the documentation (if necessary).
  • [ *] I have added appropriate unit tests (if applicable).

liseri avatar Jul 05 '24 10:07 liseri

@microsoft-github-policy-service agree

@microsoft-github-policy-service agree

liseri avatar Jul 05 '24 10:07 liseri

Thanks for the contribution @liseri! Should errors be set here as well? And do you think it would be more appropriate/helpful here to have errors=‘strict’ (which would raise an exception and exit when non-utf8 is encountered) or errors=‘ignore’?

jgbradley1 avatar Jul 05 '24 19:07 jgbradley1

Thanks for the contribution @liseri! Should errors be set here as well? And do you think it would be more appropriate/helpful here to have errors=‘strict’ (which would raise an exception and exit) or errors=‘ignore’?

This is a common bug with the Windows system in China, caused by various factors, primarily summarized as follows: GPK, as a universal code, is widely used, but the Win32 interface at the bottom of Windows defaults to using system variables. This variable can be modified to other encodings, but it may cause some programs to crash because some programs do not use this Win32 interface when running. This issue is mostly resolved by upstream frameworks specifying an encoding parameter or by directly using binary reading.

In this case, binary reading should be used instead of file stream reading.

glide-the avatar Jul 06 '24 10:07 glide-the

Thanks for the contribution @liseri! Should errors be set here as well? And do you think it would be more appropriate/helpful here to have errors=‘strict’ (which would raise an exception and exit when non-utf8 is encountered) or errors=‘ignore’?

It is sufficient to set the utf-8 parameter ; because the prompts file is generated by graphrag itself in utf-8, ensuring that read_text reads in utf-8 will guarantee correctness; if the utf-8 parameter is not set for read_text, it may read using the operating system’s default character encoding (such as gbk encoding on Windows systems).

liseri avatar Jul 08 '24 02:07 liseri

Thanks for the contribution @liseri! Should errors be set here as well? And do you think it would be more appropriate/helpful here to have errors=‘strict’ (which would raise an exception and exit when non-utf8 is encountered) or errors=‘ignore’?

It is sufficient to set the utf-8 parameter ; because the prompts file is generated by graphrag itself in utf-8, ensuring that read_text reads in utf-8 will guarantee correctness; if the utf-8 parameter is not set for read_text, it may read using the operating system’s default character encoding (such as gbk encoding on Windows systems).

liseri avatar Jul 08 '24 02:07 liseri

@liseri what are your thoughts on @glide-the 's comment? We could replace this by binary reading

AlonsoGuevara avatar Jul 09 '24 00:07 AlonsoGuevara

@liseri what are your thoughts on @glide-the 's comment? We could replace this by binary reading

I think that solution is better; I just chose the simplest approach, as I didn’t want to implement something too complicated; as long as the issue can be resolved, I’m open to any solution; I’ll go ahead and close this pull later.

liseri avatar Jul 09 '24 04:07 liseri

@liseri why close this ticket? it is a bigger issue for windows user...

KylinMountain avatar Jul 21 '24 00:07 KylinMountain