haystack icon indicating copy to clipboard operation
haystack copied to clipboard

OpenAI embedding Special Token Error

Open ayush4921 opened this issue 1 year ago • 1 comments

Describe the bug When updating embeddings, using an openai embedding model, like the following: text-embedding-3-small, we get the following error message

Error message File "/Users/ayushgarg/projects/answerThis/answerthis/answerthis.py", line 81, in createEmbeddings self.document_store.update_embeddings(self.DenseRetriever) File "/Users/ayushgarg/projects/answerThis/venv/lib/python3.8/site-packages/qdrant_haystack/document_stores/qdrant.py", line 374, in update_embeddings embeddings = retriever.embed_documents(document_batch) File "/Users/ayushgarg/projects/answerThis/venv/lib/python3.8/site-packages/haystack/nodes/retriever/dense.py", line 1862, in embed_documents return self.embedding_encoder.embed_documents(documents) File "/Users/ayushgarg/projects/answerThis/venv/lib/python3.8/site-packages/haystack/nodes/retriever/_openai_encoder.py", line 147, in embed_documents return self.embed_batch(self.doc_encoder_model, [d.content for d in docs]) File "/Users/ayushgarg/projects/answerThis/venv/lib/python3.8/site-packages/haystack/nodes/retriever/_openai_encoder.py", line 138, in embed_batch batch_limited = [self._ensure_text_limit(content) for content in batch] File "/Users/ayushgarg/projects/answerThis/venv/lib/python3.8/site-packages/haystack/nodes/retriever/_openai_encoder.py", line 138, in batch_limited = [self._ensure_text_limit(content) for content in batch] File "/Users/ayushgarg/projects/answerThis/venv/lib/python3.8/site-packages/haystack/nodes/retriever/_openai_encoder.py", line 80, in _ensure_text_limit n_tokens = len(self._tokenizer.encode(text)) File "/Users/ayushgarg/projects/answerThis/venv/lib/python3.8/site-packages/tiktoken/core.py", line 117, in encode raise_disallowed_special_token(match.group()) File "/Users/ayushgarg/projects/answerThis/venv/lib/python3.8/site-packages/tiktoken/core.py", line 400, in raise_disallowed_special_token raise ValueError( ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'. If you want this text to be encoded as a special token, pass it to allowed_special, e.g. allowed_special={'<|endoftext|>', ...}. If you want this text to be encoded as normal text, disable the check for this token by passing disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'}). To disable this check for all special tokens, pass disallowed_special=().

Expected behavior Embeddings should update without any issues.

System:

  • Haystack version (commit or version number): 1.24.1
  • DocumentStore: Qdrant

ayush4921 avatar Feb 25 '24 19:02 ayush4921

Probably the text contains some special tokens. Removing them may solve the problem.

The issue can also be solved modifying the code here https://github.com/deepset-ai/haystack/blob/4265c7cd51ed3ad26e2de7a8726dc8fccce066e1/haystack/nodes/retriever/_openai_encoder.py#L80 with allowed_special and/or disallowed_special (see tiktoken code).

I am not sure to make this change, as this is the first time this error has been reported.

anakin87 avatar Feb 26 '24 09:02 anakin87