elasticsearch configuration in haystack 2.0
As I am trying to write my documents to the Elasticsearch Document Store, I am encoutering some issues when using multi-qa-mpnet-base-dot-v1 model for my embedding.
I have initalized my document store in the following way:
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack_integrations.components.retrievers.elasticsearch import ElasticsearchEmbeddingRetriever
import os
document_store = ElasticsearchDocumentStore(hosts ="http://localhost:9200", embedding_similarity_function="dot_product")
EMBEDDING_MODEL = "./multi-qa-mpnet-base-dot-v1/"
file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf"])
text_file_converter = TextFileToDocument()
pdf_converter = PyPDFToDocument()
document_joiner = DocumentJoiner()
document_cleaner = DocumentCleaner()
document_splitter = DocumentSplitter(split_by="word", split_length=150, split_overlap=50)
document_embedder = SentenceTransformersDocumentEmbedder(model=EMBEDDING_MODEL)
document_writer = DocumentWriter(document_store)
preprocessing_pipeline = Pipeline()
preprocessing_pipeline.add_component(instance=file_type_router, name="file_type_router")
preprocessing_pipeline.add_component(instance=text_file_converter, name="text_file_converter")
preprocessing_pipeline.add_component(instance=pdf_converter, name="pypdf_converter")
preprocessing_pipeline.add_component(instance=document_joiner, name="document_joiner")
preprocessing_pipeline.add_component(instance=document_cleaner, name="document_cleaner")
preprocessing_pipeline.add_component(instance=document_splitter, name="document_splitter")
preprocessing_pipeline.add_component(instance=document_embedder, name="document_embedder")
preprocessing_pipeline.add_component(instance=document_writer, name="document_writer")
preprocessing_pipeline.connect("file_type_router.text/plain", "text_file_converter.sources")
preprocessing_pipeline.connect("file_type_router.application/pdf", "pypdf_converter.sources")
preprocessing_pipeline.connect("text_file_converter", "document_joiner")
preprocessing_pipeline.connect("pypdf_converter", "document_joiner")
preprocessing_pipeline.connect("document_joiner", "document_cleaner")
preprocessing_pipeline.connect("document_cleaner", "document_splitter")
preprocessing_pipeline.connect("document_splitter", "document_embedder")
preprocessing_pipeline.connect("document_embedder", "document_writer")
directory = "./docs"
file_paths = [os.path.join(directory, file_path) for file_path in os.listdir(directory)]
preprocessing_pipeline.run(
{
"file_type_router": {
"sources": file_paths
}
}
)
When I run the pipeline, I get the following error:
DocumentStoreError: Failed to write documents to Elasticsearch. Errors:
[{'create': {'_index': 'default', '_id': '424456ea6a4b8ce49ddf71f84c4d10d4dedfe34dd5b91212f7b463b87462991a', 'status': 400, 'error': {'type': 'document_parsing_exception', 'reason': "[1:9089] failed to parse: The [dense_vector] field [embedding] in doc [document with id '424456ea6a4b8ce49ddf71f84c4d10d4dedfe34dd5b91212f7b463b87462991a'] has more dimensions than defined in the mapping [384]..."
Is there any way I can change the dimensions allowed for the elasticsearch vector database?
On a separate note, does the elastic search document store allow us to control user access controls as described here: https://www.elastic.co/guide/en/elasticsearch/reference/current/authorization.html
Thank you!
-
Starting from ElasticSearch 8.11.1, dimensions of vectors are automatically inferred when the first vector is written (to the index). So I would recommend that you:
- check whether you are using ES>8.11.1
- specify a new index (see API reference)
-
If I understand well, authorization involves how you connect to ES. The
__init__should be flexible enough to support this.
@anakin87 thanks for the response! So by creating a new index, that did fix that error. However, ran into a new one relating to the dot-product similarity score.
DocumentStoreError: Failed to write documents to Elasticsearch. Errors:
[{'create': {'_index': 'new', '_id': '424456ea6a4b8ce49ddf71f84c4d10d4dedfe34dd5b91212f7b463b87462991a', 'status': 400, 'error': {'type': 'document_parsing_exception', 'reason': '[1:16851] failed to parse: The [dot_product] similarity can only be used with unit-length vectors. Preview of invalid vector: [0.30049285, -0.49391782, -0.19158463, 0.16830371, 0.22430387, ...]', 'caused_by': {'type': 'illegal_argument_exception'
Based on this stackoverflow, it has to do with the np float type.