SharpVector icon indicating copy to clipboard operation
SharpVector copied to clipboard

Cannot search Chinese content

Open soulyet opened this issue 1 year ago • 1 comments

I have some Chinese content import to the vector db, but it cannot be searched. English content is no problem. Is it not support Chinese content search?

soulyet avatar Sep 01 '24 02:09 soulyet

hmm... I'll have to look into this. Thanks for reporting it. :)

crpietschmann avatar Sep 24 '24 13:09 crpietschmann

I added some unit tests that appear to work without error. Unless you have more information about what isn't working, as far as I can tell it does.

crpietschmann avatar Feb 22 '25 23:02 crpietschmann

Hmm... I have found that using the BasicMemoryVectorDatabase the search query isn't reliable when Chinese text documents are searched. I added some unit tests, but then commented them out as I found they pass/fail randomly when run. The search query for Chinese characters is unreliable at the moment.

It seems to me that the reason is the way the BasicMemoryVectorDatabase handles generating the vectors for the text. The behavior is that it splits the text based on spaces and uses that to build the vocabulary dictionary. With Chinese this isn't something that can be done as the language works differently.

Something that should work is to use the Build5Nines.SharpVector.OpenAI library to integrate with OpenAI or Azure OpenAI service to use a model like text-embeddings-ada-002 to perform the vector generations. This should work as expected since that model will handle generating the embeddings better.

Thanks for reporting this. I'll have to table this issue for now, until a solution can be determined for the BasicMemoryVectorDatabase implementation.

crpietschmann avatar Feb 22 '25 23:02 crpietschmann

Actually, I was able to figure out a solution that seems to fix the issues with Chinese language/character support. This will be in the next release.

crpietschmann avatar Feb 23 '25 00:02 crpietschmann

This has been released in v2.0.0

crpietschmann avatar Feb 23 '25 14:02 crpietschmann