semantic-kernel Pinecone dataconnector

Motivation and Context

Description

Contribution Checklist

[x] The code builds clean without any errors or warnings
[x] The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
[x] The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with dotnet format
[x] All unit tests pass, and I have added new tests where possible
[x] I didn't break anyone :smile:

May 02 '23 16:05 Kevdome3000

@shawncal @dmytrostruk

May 02 '23 17:05 Kevdome3000

First of all, thanks for the awesome job! A lot of new code was added and it's going to be very useful for Semantic Kernel users!

Before merging it to main, a couple of things need to be resolved:

[x] Resolve all PR comments.

[x] Current PR is missing Pinecone.csproj in solution file, please add it and verify that solution builds.

[x] Resolve warning Call System.IDisposable.Dispose on object created by 'batch.ToNamespace(indexNamespace).Build()' before all references to it are out of scope | dotnet\src\Connectors\Connectors.Memory.Pinecone\PineconeClient.cs | Line 249

[x] Resolve all warnings ... change parameter name cancel to cancellationToken in order to match the identifier as it has been declared in .... In short, please use cancellationToken parameter name instead of cancel in all places.

[ ] Ensure that example in Example29_Pinecone.cs file works without any issues.

[x] Ensure that unit tests are passed.

As soon as this list is completed, I think the PR will be ready for merge to main. Thanks again!

These have been resolved except I haven't be able to confirm Example31_Pinecone.cs works because I'm getting a 401 from my OpenAI APiKey all of a sudden. You should be able to run it on your end

May 03 '23 21:05 Kevdome3000

These have been resolved except I haven't be able to confirm Example31_Pinecone.cs works because I'm getting a 401 from my OpenAI APiKey all of a sudden. You should be able to run it on your end

Hey, I think it's not your API key, the reason is that we updated interface for adding backends and now it doesn't accept service label, so during your testing the key was passed as a wrong parameter. I fixed the example in this commit: https://github.com/microsoft/semantic-kernel/pull/770/commits/1d32b4d914f8ec6650794b8793175a8c77e08f06

Please try again and let me know if that works for you.

May 04 '23 12:05 dmytrostruk

I tested manually and there are a couple of things here:

PineconeEnvironment is currently an enumeration with predefined values (us-west1-gcp, us-west4-gcp etc.). In my free account the environment is different - us-west1-gcp-free, so I'm not able to select the valid environment programmatically with current implementation:

Suggestion: in case if Pinecone environment list can be flexible, maybe it's better to convert that property to string. It will also help us to avoid situations when Pinecone will add a new geo/region/environment to their list and we will have to update our enumeration manually. Same with PodType, please ensure that this property needs to be enum instead of plain string. In case Pod type list is potentially extensible in the future, it is probably better to convert it to string as well.

I tried to run current example and it doesn't work, on index creation step I'm receiving 500 Internal Server Error without error message. Here is the body which we are sending:
{
  "name": "pinecone-test",
  "metric": "cosine",
  "pod_type": "p1.x1",
  "dimension": 0,
  "pods": 0,
  "replicas": 0
}
I took this body and tried to send the same parameters in Pinecone playground and got the same problem. I believe the reason is in dimension and/or pods and/or replicas, it looks like they can't be zero.

In PineconeMemoryStore we initialize the index object in the way, that the parameters above are zero:

So it looks like a bug.

Please, ensure that the example you provided works without any errors. We need to make sure that the solution we provide is working. If needed, please add more tests to verify that it's working as expected. Thank you!

On the free account you can only have 1x1 pod. So once an index is created with 1 pod and 1 replica. All subsequent indexed must be 1 pod, 0 replicas. Also dimension is a required field in index creation.

I agree that we should be careful with the enumeration re: environment passed as this may change.

May 04 '23 13:05 tawalke

These have been resolved except I haven't be able to confirm Example31_Pinecone.cs works because I'm getting a 401 from my OpenAI APiKey all of a sudden. You should be able to run it on your end

Hey, I think it's not your API key, the reason is that we updated interface for adding backends and now it doesn't accept service label, so during your testing the key was passed as a wrong parameter. I fixed the example in this commit: 1d32b4d

Please try again and let me know if that works for you.

is this the fix you are refering to?


 IKernel kernel = Kernel.Builder
            .WithLogger(ConsoleLogger.Log)
            .Configure(c =>
            {
                c.AddOpenAITextCompletionService("text-davinci-003", Env.Var("OPENAI_API_KEY"));
                c.AddOpenAITextEmbeddingGenerationService("text-embedding-ada-002", Env.Var("OPENAI_API_KEY"));
            })
            .WithMemoryStorage(memoryStore)
            .Build();

if so, I am still getting a 401 error and confirmed it's not my api key by calling the open ai api direct to generate an embedding.

I even tried something simple like this and still get the same error.

Console.WriteLine( Env.Var( "OPENAI_API_KEY"));
        var embeddingGenerator = new OpenAITextEmbeddingGeneration("text-embedding-ada-002", Env.Var("OPENAI_API_KEY"));
        
        var embedding = await embeddingGenerator.GenerateEmbeddingAsync("My favorite color is orange");

        Console.WriteLine( $"Embedding: {string.Join(", ", embedding)}");

here's the error:

Unhandled exception. Microsoft.SemanticKernel.AI.AIException: Access denied: The request is not authorized, HTTP status: 401
   at Microsoft.SemanticKernel.Connectors.AI.OpenAI.AzureSdk.ClientBase.RunRequestAsync[T](Func`1 request) in semantic-kernel/dotnet/src/Connectors/Connectors.AI.OpenAI/AzureSdk/ClientBase.cs:line 259
   at Microsoft.SemanticKernel.Connectors.AI.OpenAI.AzureSdk.ClientBase.InternalGenerateTextEmbeddingsAsync(IList`1 data, CancellationToken cancellationToken) in semantic-kernel/dotnet/src/Connectors/Connectors.AI.OpenAI/AzureSdk/ClientBase.cs:line 101
   at Microsoft.SemanticKernel.AI.Embeddings.EmbeddingGenerationExtensions.GenerateEmbeddingAsync[TValue,TEmbedding](IEmbeddingGeneration`2 generator, TValue value, CancellationToken cancellationToken) in semantic-kernel/dotnet/src/SemanticKernel.Abstractions/AI/Embeddings/IEmbeddingGeneration.cs:line 47
   at Example32_Pinecone.RunAsync() in semantic-kernel/samples/dotnet/kernel-syntax-examples/Example32_Pinecone.cs:line 29

May 04 '23 20:05 Kevdome3000

here is the list of environments from Pinecone: https://docs.pinecone.io/docs/projects

looks like the us-west1-gcp-free environment was just added so I'll add an enum for it. My view is that keeping it an enum will alleviate a lot of formatting/ misspelling issues making it more developer friendly, and adding another option is simple should pinecone add more environments.

I just created another free account with a p2.x1 (fastest , and the one I prefer) and was given the environment us-west4-gcp , so it's unclear why they have us-west1-gcp-free.

as for creating index with the following body:

{
  "name": "pinecone-test",
  "metric": "cosine",
  "pod_type": "p1.x1",
  "dimension": 0,
  "pods": 0,
  "replicas": 0
}

Dimension and pods can never be zero. I just added default values for them

May 04 '23 21:05 Kevdome3000

Can someone please try the PineconeExample_32 in the KernalSyntaxExamples? I'm still blocked

May 04 '23 21:05 Kevdome3000

looks like the us-west1-gcp-free environment was just added so I'll add an enum for it. My view is that keeping it an enum will alleviate a lot of formatting/ misspelling issues making it more developer friendly, and adding another option is simple should pinecone add more environments.

You didn't add us-west1-gcp-free environment in this place:

That's what I'm talking about, in order to add new environment as enum value, we should also remember to add it in these two methods. So we either should avoid these two methods and keep PineconeEnvironment enum as one source of truth of Pinecone environments, or make it just string. Also, in new code I don't see the usage of these enum values. In Example32_Pinecone, we read environment from Env.Var as string, for Pinecone API we also use string value:

The way I see it, since Pinecone environment is used only to construct API base URL and those environments can be extended in the future, I'm not sure we need to provide enumeration for Semantic Kernel users. If users want to maintain Pinecone environments in specific way, they can always add this enum on their side, but pass string value to our implementation.

May 04 '23 22:05 dmytrostruk

Also, looks like there is no Asia environment in PineconeEnvironment enumeration:

So it's not clear why we list only some of them and not all of them. Again, as I suggested previously, string type for Pinecone environment should work fine.

May 04 '23 22:05 dmytrostruk

Can someone please try the PineconeExample_32 in the KernalSyntaxExamples? I'm still blocked

Just tested it. This time I was able to create an index, but I received TimeoutException here:

My index is still in Initializing state in Pinecone, already 10 minutes. We can't block the execution and wait for Ready index state for so long, even for 1 minute. It looks like it is better to implement PineconeMemoryStore only in the way, when the index already exists. Otherwise, throw an error immediately.

May 04 '23 22:05 dmytrostruk

if so, I am still getting a 401 error and confirmed it's not my api key by calling the open ai api direct to generate an embedding.

I was able to generate an embedding with my OpenAI key, within Example32_Pinecone. Please ensure you set valid key in Env.Var. But I received TimeoutException in PineconeMemoryStore, see comment above.

May 04 '23 22:05 dmytrostruk

Also, looks like there is no Asia environment in PineconeEnvironment enumeration:

So it's not clear why we list only some of them and not all of them. Again, as I suggested previously, string type for Pinecone environment should work fine.

Fair point... I'm goin to remove the enum completely

May 05 '23 00:05 Kevdome3000

@Kevdome3000 please let me know what you think about my last comment above. In case you need help, I can join you and assist in implementation. Thanks for all the hard work and contribution, we are very close to merging!

May 05 '23 12:05 dmytrostruk

@Kevdome3000 please let me know what you think about my last comment above. In case you need help, I can join you and assist in implementation. Thanks for all the hard work and contribution, we are very close to merging!

Current status and feedback: I was able to save information to Pinecone vectors and then perform a search within Example32_Pinecone.cs, which is great! I think we are almost on the finish line 😃🎉

Remaining issues (sorry for long text, but it's important to describe it in detail):

[x] Creating an index in Pinecone is asynchronous operation, and we receive 201 Created, index state is Initializing and we should wait until the state will be Ready. When I tried the integration for the first time, my index initialization failed (state was InitializationFailed, looks like internal problem in Pinecone). This flow doesn't fit in current IMemoryStore implementation, where we require to create an index/collection immediately. Suggestion: for now we can remove logic for creating an index. In case it doesn't exist in user's account we can throw an exception and specify that index should be created manually before usage (in similar way as table in SQL database, which we usually create only once and then re-use it).

[x] Each index (a.k.a. collection) in Pinecone has its own host, which we should use to make API calls. In current implementation, we have EnsurePineconeClientReadyAsync in PineconeMemoryStore which fetches index host before doing vector operations. Ideally, PineconeMemoryStore shoudn't know about specifics of Pinecone API, it should use PineconeClient to make API operations without ensuring that the host is ready and if not, then fetching index definition and then performing an operations. Suggestion: Move all logic for "reconnection" (i.e. fetching index host when it's not there) to PineconeClient and make it internal. In the same way how we would implement OAuth protocol to fetch access token before making actual request.

[ ] Caching: in PineconeClient we have member PineconeIndex? _indexDescription which works as a cache, it stores index information, including host value, which is required for making API calls. If we call kernel.Memory.SaveInformationAsync three times with three different index/collection names, _indexDescription member will be re-assigned, so it keeps information only about last index. Suggestion: Do we actually need to fetch index host at all? Index host template is following: https://{index_name}-{project_id}.svc.{environment}.pinecone.io, which means that it's constant once you created index, relationship between index name and index host is 1:1. Instead of indexName user can pass indexHost and avoid extra calls to fetch that {project-id} parameter, which will never change. In case we still want to pass indexName, so the user will pass just pinecone-test instead of long pinecone-test-d6e9d39.svc.us-west1-gcp-free.pinecone.io, then we can make a call and fetch index host, but for caching introduce ConcurrentDictionary<string, string>, where key will be index name and value will be index host.

I'm on board with all of the feedback. My plan was to do as you suggested and move IntitalizeAsync and EnsurePineconeClientReadyAsync into Pinecone client and merge with WaitForIndexState. This way a user can do whatever IndexOperation they may need to do and then when ready create a PineconeMemoryStore instance with the client.

On the point about fetching the index host, I agree that it would be ideal to simply pass the index name and use a Index host template, even Pinecone documentation suggests to use https://index_name-{indexName}.svc.environment.pinecone.io but i was never able to get that to work. I hadn't thought about combining the project id largely because I don't know where to find it. If you can get that to work then great!

May 05 '23 19:05 Kevdome3000

@Kevdome3000 please let me know what you think about my last comment above. In case you need help, I can join you and assist in implementation. Thanks for all the hard work and contribution, we are very close to merging!

Help would be awesome! Sorry I've been slow to get this last step over the line, just haven't be able to find time

May 05 '23 19:05 Kevdome3000

@Kevdome3000 please let me know what you think about my last comment above. In case you need help, I can join you and assist in implementation. Thanks for all the hard work and contribution, we are very close to merging!

Current status and feedback: I was able to save information to Pinecone vectors and then perform a search within Example32_Pinecone.cs, which is great! I think we are almost on the finish line 😃🎉 Remaining issues (sorry for long text, but it's important to describe it in detail):

[x] Creating an index in Pinecone is asynchronous operation, and we receive 201 Created, index state is Initializing and we should wait until the state will be Ready. When I tried the integration for the first time, my index initialization failed (state was InitializationFailed, looks like internal problem in Pinecone). This flow doesn't fit in current IMemoryStore implementation, where we require to create an index/collection immediately. Suggestion: for now we can remove logic for creating an index. In case it doesn't exist in user's account we can throw an exception and specify that index should be created manually before usage (in similar way as table in SQL database, which we usually create only once and then re-use it).

[x] Each index (a.k.a. collection) in Pinecone has its own host, which we should use to make API calls. In current implementation, we have EnsurePineconeClientReadyAsync in PineconeMemoryStore which fetches index host before doing vector operations. Ideally, PineconeMemoryStore shoudn't know about specifics of Pinecone API, it should use PineconeClient to make API operations without ensuring that the host is ready and if not, then fetching index definition and then performing an operations. Suggestion: Move all logic for "reconnection" (i.e. fetching index host when it's not there) to PineconeClient and make it internal. In the same way how we would implement OAuth protocol to fetch access token before making actual request.

[ ] Caching: in PineconeClient we have member PineconeIndex? _indexDescription which works as a cache, it stores index information, including host value, which is required for making API calls. If we call kernel.Memory.SaveInformationAsync three times with three different index/collection names, _indexDescription member will be re-assigned, so it keeps information only about last index. Suggestion: Do we actually need to fetch index host at all? Index host template is following: https://{index_name}-{project_id}.svc.{environment}.pinecone.io, which means that it's constant once you created index, relationship between index name and index host is 1:1. Instead of indexName user can pass indexHost and avoid extra calls to fetch that {project-id} parameter, which will never change. In case we still want to pass indexName, so the user will pass just pinecone-test instead of long pinecone-test-d6e9d39.svc.us-west1-gcp-free.pinecone.io, then we can make a call and fetch index host, but for caching introduce ConcurrentDictionary<string, string>, where key will be index name and value will be index host.

I'm on board with all of the feedback. My plan was to do as you suggested and move IntitalizeAsync and EnsurePineconeClientReadyAsync into Pinecone client and merge with WaitForIndexState. This way a user can do whatever IndexOperation they may need to do and then when ready create a PineconeMemoryStore instance with the client.

On the point about fetching the index host, I agree that it would be ideal to simply pass the index name and use a Index host template, even Pinecone documentation suggests to use https://index_name-{indexName}.svc.environment.pinecone.io but i was never able to get that to work. I hadn't thought about combining the project id largely because I don't know where to find it. If you can get that to work then great!

My latest commit should resolve these issues except the question around Caching

May 08 '23 04:05 Kevdome3000

@Kevdome3000 I added caching and removed unused methods. Please see my commit here: https://github.com/microsoft/semantic-kernel/pull/770/commits/2dac6b96d96fd9a7ac6c673791e9412b5a91189d Please let me know if it looks good to you, and if yes, then we are ready to merge it to main.

May 08 '23 14:05 dmytrostruk

@Kevdome3000 I added caching and removed unused methods. Please see my commit here: 2dac6b9 Please let me know if it looks good to you, and if yes, then we are ready to merge it to main.

looks good and thanks for the assist!

May 08 '23 16:05 Kevdome3000

@shawncal Could you please take a look and check if we ready to merge this to main?

May 10 '23 09:05 dmytrostruk

@shawncal Shawn Callegari FTE Could you please take a look and check if we ready to merge this to main?

Certainly. Just updated the branch.

May 10 '23 20:05 shawncal

Hey @Kevdome3000 looks like this is nearly ready to go! You'll notice we checked in some new dotnet analyzers today that broke the formatting checks, but that should be easy to fix (most will be automatic). If it's alright with you, I'd like to retarget your PR to a feature branch, "feature-pinecone", so that I can more easily make those style updates for you and run a battery of tests, then merge to main shortly thereafter. Sound good?

May 18 '23 03:05 shawncal

Hey @Kevdome3000 looks like this is nearly ready to go! You'll notice we checked in some new dotnet analyzers today that broke the formatting checks, but that should be easy to fix (most will be automatic). If it's alright with you, I'd like to retarget your PR to a feature branch, "feature-pinecone", so that I can more easily make those style updates for you and run a battery of tests, then merge to main shortly thereafter. Sound good?

Indeed, let me know if you need anything else from

May 18 '23 03:05 Kevdome3000