haystack icon indicating copy to clipboard operation
haystack copied to clipboard

elasticsearch configuration in haystack 2.0

Open ss2342 opened this issue 1 year ago • 2 comments

As I am trying to write my documents to the Elasticsearch Document Store, I am encoutering some issues when using multi-qa-mpnet-base-dot-v1 model for my embedding.

I have initalized my document store in the following way:

from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack_integrations.components.retrievers.elasticsearch import ElasticsearchEmbeddingRetriever
import os

document_store = ElasticsearchDocumentStore(hosts ="http://localhost:9200", embedding_similarity_function="dot_product")

EMBEDDING_MODEL = "./multi-qa-mpnet-base-dot-v1/"

file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf"])
text_file_converter = TextFileToDocument()

pdf_converter = PyPDFToDocument()
document_joiner = DocumentJoiner()

document_cleaner = DocumentCleaner()
document_splitter = DocumentSplitter(split_by="word", split_length=150, split_overlap=50)

document_embedder = SentenceTransformersDocumentEmbedder(model=EMBEDDING_MODEL)
document_writer = DocumentWriter(document_store)

preprocessing_pipeline = Pipeline()

preprocessing_pipeline.add_component(instance=file_type_router, name="file_type_router")
preprocessing_pipeline.add_component(instance=text_file_converter, name="text_file_converter")
preprocessing_pipeline.add_component(instance=pdf_converter, name="pypdf_converter")
preprocessing_pipeline.add_component(instance=document_joiner, name="document_joiner")
preprocessing_pipeline.add_component(instance=document_cleaner, name="document_cleaner")
preprocessing_pipeline.add_component(instance=document_splitter, name="document_splitter")
preprocessing_pipeline.add_component(instance=document_embedder, name="document_embedder")
preprocessing_pipeline.add_component(instance=document_writer, name="document_writer")

preprocessing_pipeline.connect("file_type_router.text/plain", "text_file_converter.sources")
preprocessing_pipeline.connect("file_type_router.application/pdf", "pypdf_converter.sources")
preprocessing_pipeline.connect("text_file_converter", "document_joiner")
preprocessing_pipeline.connect("pypdf_converter", "document_joiner")
preprocessing_pipeline.connect("document_joiner", "document_cleaner")
preprocessing_pipeline.connect("document_cleaner", "document_splitter")
preprocessing_pipeline.connect("document_splitter", "document_embedder")
preprocessing_pipeline.connect("document_embedder", "document_writer")

directory = "./docs"

file_paths = [os.path.join(directory, file_path) for file_path in os.listdir(directory)]

preprocessing_pipeline.run(
    {
        "file_type_router": {
            "sources": file_paths
        }
    }
)

When I run the pipeline, I get the following error:

DocumentStoreError: Failed to write documents to Elasticsearch. Errors:
[{'create': {'_index': 'default', '_id': '424456ea6a4b8ce49ddf71f84c4d10d4dedfe34dd5b91212f7b463b87462991a', 'status': 400, 'error': {'type': 'document_parsing_exception', 'reason': "[1:9089] failed to parse: The [dense_vector] field [embedding] in doc [document with id '424456ea6a4b8ce49ddf71f84c4d10d4dedfe34dd5b91212f7b463b87462991a'] has more dimensions than defined in the mapping [384]..."

Is there any way I can change the dimensions allowed for the elasticsearch vector database?

On a separate note, does the elastic search document store allow us to control user access controls as described here: https://www.elastic.co/guide/en/elasticsearch/reference/current/authorization.html

Thank you!

ss2342 avatar Feb 13 '24 05:02 ss2342

  1. Starting from ElasticSearch 8.11.1, dimensions of vectors are automatically inferred when the first vector is written (to the index). So I would recommend that you:

    • check whether you are using ES>8.11.1
    • specify a new index (see API reference)
  2. If I understand well, authorization involves how you connect to ES. The __init__ should be flexible enough to support this.

anakin87 avatar Feb 13 '24 08:02 anakin87

@anakin87 thanks for the response! So by creating a new index, that did fix that error. However, ran into a new one relating to the dot-product similarity score.

DocumentStoreError: Failed to write documents to Elasticsearch. Errors:
[{'create': {'_index': 'new', '_id': '424456ea6a4b8ce49ddf71f84c4d10d4dedfe34dd5b91212f7b463b87462991a', 'status': 400, 'error': {'type': 'document_parsing_exception', 'reason': '[1:16851] failed to parse: The [dot_product] similarity can only be used with unit-length vectors. Preview of invalid vector: [0.30049285, -0.49391782, -0.19158463, 0.16830371, 0.22430387, ...]', 'caused_by': {'type': 'illegal_argument_exception'

Based on this stackoverflow, it has to do with the np float type.

ss2342 avatar Feb 13 '24 16:02 ss2342