Generating KB embeddings does not work if AZURE OPENAI API KEY is set
Generating the KB-embeddings with
python dataset_generation/generate_kb_embeddings.py --dataset_path datasets/enron.json --output_path datasets --model_name text-embedding-3-small
if the environment variable AZURE_OPENAI_API_KEY is set to the proper value, causes the following error:
File "/home/fokus/miniforge3/envs/kblam/lib/python3.13/site-packages/openai/_base_client.py", line 919, in request return self._request( ~~~~~~~~~~~~~^ cast_to=cast_to, ^^^^^^^^^^^^^^^^ ...<3 lines>... retries_taken=retries_taken, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/fokus/miniforge3/envs/kblam/lib/python3.13/site-packages/openai/_base_client.py", line 1023, in _request raise self._make_status_error_from_response(err.response) from None openai.AuthenticationError: Error code: 401 - {'statusCode': 401, 'message': 'Unauthorized. Access token is missing, invalid, audience is incorrect (urn:ms.scopedToken or urn:ms.faceSessionToken), or have expired.'}
Since I am a newbie to azure, I thought this behavior was caused by wrong permissions in azure. But after two days of frustrated trial and error with azure I finally found out, that the problem is caused in file src/kblam/gpt_session.py line 45:
azure_ad_token_provider=token_provider,
which gets determined by
def _get_credential(self, lib_name: str = "azure_openai") -> DeviceCodeCredential:
This does not account for the environment variable. Just removing line 45 allows to use the rules implemented in the openai lib to determine the right credentials from environment variables. If one likes to cache the credentials, it would be better to modify _get_credential, so that first the openai rules are used. But if the credentials live in the environment variables there is no need to cache them.
Additionally, hard coding the api-version in line 23 src/kblam/gpt_session.py isn't a good idea, since it has changed already ...