.Net: Memory Plugin (SQLite) Recall function can not find relevant memory if search text is not English
Describe the bug Using SQLite memory plugin, recall function can not find any relevant memory if search text is not in English language, even if there is an exact match. For my case, I was searching with Thai language.
To Reproduce Steps to reproduce the behavior:
- Use SQLite memory plugin.
- Store Thai language data in Metadata.
- Call Recall function to find relevant memory with Thai language text.
- You will not get any relevant memory.
Expected behavior After search, it should return data as there is an exact match
Screenshots
Platform
- OS: Windows 11
- IDE: Visual Studio Enterprise 2022 (64-bit) - Current Version 17.9.6
- Language: C#
- Source: Microsoft.SemanticKernel - 1.9.0 Microsoft.SemanticKernel.Connectors.Sqlite - 1.6.3-alpha Microsoft.SemanticKernel.Plugins.Memory - 1.6.3-alpha
Code
var result = await _kernel.InvokeAsync(_memoryPlugin["Recall"], new KernelArguments() { [TextMemoryPlugin.InputParam] = "สำหรับรายละเอียด", [TextMemoryPlugin.CollectionParam] = memoryCollectionName, [TextMemoryPlugin.LimitParam] = numberOfMemories, [TextMemoryPlugin.RelevanceParam] = "0.40", });
In my case "Bengali" is also not working. Is there any fix
@atiq-bs23, @Juhan-Hossain, I am not able to reproduce this behavior with Sqlite. I added the following test to the SqliteMemoryStoreTests unit test set and it passes. No character escaping is happening on get or search. Do you have a repro that you can share?
[Fact]
public async Task StoreAndRetrieveNonLatinScript()
{
// Arrange
using SqliteMemoryStore db = await SqliteMemoryStore.ConnectAsync(DatabaseFile);
string collectionName = "test_collection" + this._collectionNum;
this._collectionNum++;
MemoryRecord testRecord = MemoryRecord.LocalRecord(
id: "test-Thai",
text: "วรรณยุกต์",
description: "วรรณยุกต์",
embedding: new float[] { 1, 1, 1 });
_ = await db.UpsertAsync(collectionName, testRecord);
testRecord = MemoryRecord.LocalRecord(
id: "test-Bengali",
text: "চলিতভাষা",
description: "চলিতভাষা",
embedding: new float[] { -1, -1, -1 });
_ = await db.UpsertAsync(collectionName, testRecord);
// Act
var thaiGetActual = await db.GetAsync(collectionName, "test-Thai");
var bengaliGetActual = await db.GetAsync(collectionName, "test-Bengali");
var thaiSearchActual = db.GetNearestMatchesAsync(collectionName, new float[] { 1, 1, 1 }, limit: 1, minRelevanceScore: 1).ToEnumerable().ToArray();
var bengaliSearchActual = db.GetNearestMatchesAsync(collectionName, new float[] { -1, -1, -1 }, limit: 1, minRelevanceScore: 1).ToEnumerable().ToArray();
// Assert
Assert.Equal("วรรณยุกต์", thaiGetActual!.Metadata.Text);
Assert.Equal("วรรณยุกต์", thaiGetActual!.Metadata.Description);
Assert.Equal("চলিতভাষা", bengaliGetActual!.Metadata.Text);
Assert.Equal("চলিতভাষা", bengaliGetActual!.Metadata.Description);
Assert.Equal("วรรณยุกต์", thaiSearchActual.First().Item1.Metadata.Text);
Assert.Equal("วรรณยุกต์", thaiSearchActual.First().Item1.Metadata.Description);
Assert.Equal("চলিতভাষা", bengaliSearchActual.First().Item1.Metadata.Text);
Assert.Equal("চলিতভাষা", bengaliSearchActual.First().Item1.Metadata.Description);
}
@westey-m I apologize for the confusion caused by the title of the issue. The actual problem is that the recall function is returning encoded data. This happens because the MemoryRecord object is being serialized before it is inserted into the database, resulting in the recall function returning the encoded form of the data.
`var memoryRecord = new MemoryRecord { Text = "সিরাজগঞ্জ জেলার কোন ইতিহাস বা স্থাপনা", Description = "সিরাজগঞ্জ বাংলাদেশ এর একটি জেলা। এটি একটি গুরুত্বপূর্ণ ব্যবসায়িক কেন্দ্র এবং এটি যমুনা নদীর তীরে অবস্থিত। সিরাজগঞ্জের প্রধান আকর্ষণগুলির মধ্যে রয়েছে যমুনা সেতু, যা বাংলাদেশের অন্যতম বৃহত্তম সেতু।" };
string metadata = JsonSerializer.Serialize(memoryRecord);
`
@atiq-bs23, thanks for clarifying, that makes sense now. I think the right approach then is to allow a developer to provide an optional JsonSerializerOptions via the TextMemoryPlugin constructor, since JsonSerializer actually allows control over how this encoding happens. That way, we don't need to fix the data afterwards, it'll just be serialized as required to begin with. Making this an option also means that anyone who is relying on the encoding today isn't broken.
See this example of how JsonSerializerOptions controls this behavior:
// Arrange
MemoryRecord record = MemoryRecord.LocalRecord(
id: "test",
text: "วรรณยุกต์ চলিতভাষা",
description: "",
embedding: new float[] { 1, 1, 1 });
// Act
var defaultActual = JsonSerializer.Serialize(record);
var thaiEncoderActual = JsonSerializer.Serialize(record, new JsonSerializerOptions { Encoder = JavaScriptEncoder.Create(UnicodeRanges.BasicLatin, UnicodeRanges.Thai) });
var thaiAndBengaliEncoderActual = JsonSerializer.Serialize(record, new JsonSerializerOptions { Encoder = JavaScriptEncoder.Create(UnicodeRanges.BasicLatin, UnicodeRanges.Thai, UnicodeRanges.Bengali) });
var allUnsafeEncoderActual = JsonSerializer.Serialize(record, new JsonSerializerOptions { Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping });
// Assert
var defaultExpected = """{"embedding":[1,1,1],"metadata":{"is_reference":false,"external_source_name":"","id":"test","description":"","text":"\u0E27\u0E23\u0E23\u0E13\u0E22\u0E38\u0E01\u0E15\u0E4C \u099A\u09B2\u09BF\u09A4\u09AD\u09BE\u09B7\u09BE","additional_metadata":""},"key":"","timestamp":null}""";
Assert.Equal(defaultExpected, defaultActual);
var thaiEncoderExpected = """{"embedding":[1,1,1],"metadata":{"is_reference":false,"external_source_name":"","id":"test","description":"","text":"วรรณยุกต์ \u099A\u09B2\u09BF\u09A4\u09AD\u09BE\u09B7\u09BE","additional_metadata":""},"key":"","timestamp":null}""";
Assert.Equal(thaiEncoderExpected, thaiEncoderActual);
var thaiAndBengaliEncoderExpected = """{"embedding":[1,1,1],"metadata":{"is_reference":false,"external_source_name":"","id":"test","description":"","text":"วรรณยุกต์ চলিতভাষা","additional_metadata":""},"key":"","timestamp":null}""";
Assert.Equal(thaiAndBengaliEncoderExpected, thaiAndBengaliEncoderActual);
var allUnsafeEncoderExpected = """{"embedding":[1,1,1],"metadata":{"is_reference":false,"external_source_name":"","id":"test","description":"","text":"วรรณยุกต์ চলিতভাষা","additional_metadata":""},"key":"","timestamp":null}""";
Assert.Equal(allUnsafeEncoderExpected, allUnsafeEncoderActual);```
@westey-m Yes, there should have an option to pass the JsonSerializerOptions.
await _kernel.InvokeAsync(_memoryPlugin["Save"], new KernelArguments() { [TextMemoryPlugin.InputParam] = $"{memory}", [TextMemoryPlugin.CollectionParam] = memoryCollectionName, [TextMemoryPlugin.KeyParam] = key });
@atiq-bs23, I wouldn't expect the options to be passed during InvokeAsync but rather when constructing the plugin, e.g.
kernel.ImportPluginFromObject(new Microsoft.SemanticKernel.Plugins.Memory.TextMemoryPlugin(textMemory, myJsonSerializerOptions));
Okay @westey-m thanks a lot.
Reopening the issue since I am creating a PR to fix this as discussed above.
Closing since the fix is merged.