OpenAI embedding Special Token Error
Describe the bug When updating embeddings, using an openai embedding model, like the following: text-embedding-3-small, we get the following error message
Error message
File "/Users/ayushgarg/projects/answerThis/answerthis/answerthis.py", line 81, in createEmbeddings
self.document_store.update_embeddings(self.DenseRetriever)
File "/Users/ayushgarg/projects/answerThis/venv/lib/python3.8/site-packages/qdrant_haystack/document_stores/qdrant.py", line 374, in update_embeddings
embeddings = retriever.embed_documents(document_batch)
File "/Users/ayushgarg/projects/answerThis/venv/lib/python3.8/site-packages/haystack/nodes/retriever/dense.py", line 1862, in embed_documents
return self.embedding_encoder.embed_documents(documents)
File "/Users/ayushgarg/projects/answerThis/venv/lib/python3.8/site-packages/haystack/nodes/retriever/_openai_encoder.py", line 147, in embed_documents
return self.embed_batch(self.doc_encoder_model, [d.content for d in docs])
File "/Users/ayushgarg/projects/answerThis/venv/lib/python3.8/site-packages/haystack/nodes/retriever/_openai_encoder.py", line 138, in embed_batch
batch_limited = [self._ensure_text_limit(content) for content in batch]
File "/Users/ayushgarg/projects/answerThis/venv/lib/python3.8/site-packages/haystack/nodes/retriever/_openai_encoder.py", line 138, in allowed_special, e.g. allowed_special={'<|endoftext|>', ...}.
If you want this text to be encoded as normal text, disable the check for this token by passing disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'}).
To disable this check for all special tokens, pass disallowed_special=().
Expected behavior Embeddings should update without any issues.
System:
- Haystack version (commit or version number): 1.24.1
- DocumentStore: Qdrant
Probably the text contains some special tokens. Removing them may solve the problem.
The issue can also be solved modifying the code here https://github.com/deepset-ai/haystack/blob/4265c7cd51ed3ad26e2de7a8726dc8fccce066e1/haystack/nodes/retriever/_openai_encoder.py#L80
with allowed_special and/or disallowed_special
(see tiktoken code).
I am not sure to make this change, as this is the first time this error has been reported.