Help related initializing the Faiss Document Store
Hey guys I am initializing a document store at the start of a flask app.
document_store = FAISSDocumentStore(sql_url="sqlite:///faiss_document_store.db",
embedding_dim=128, faiss_index_factory_str="Flat")
Whenever I update the Document Store it works fine. But as soon as I restart my flask app I get "ValueError: The number of documents present in the SQL database does not match the number of embeddings in FAISS. Make sure your FAISS configuration file correctly points to the same database used when creating the original index."
Can anyone help me with this issue?
from dotenv import load_dotenv
from flask import Flask
from flask import request
from flask import abort, jsonify
from flask_expects_json import expects_json
import textract
import os
import requests
import bs4
from haystack.utils import convert_files_to_docs, clean_wiki_text
from haystack.nodes import Seq2SeqGenerator
from haystack.nodes import DensePassageRetriever
from haystack.document_stores import FAISSDocumentStore
from haystack.pipelines import GenerativeQAPipeline
app = Flask(__name__)
generator = Seq2SeqGenerator(model_name_or_path="vblagoje/bart_lfqa")
document_store = FAISSDocumentStore(sql_url="sqlite:///faiss_document_store.db",
embedding_dim=128, faiss_index_factory_str="Flat")
@ app.route('/read_file', methods=['POST'])
def read_file():
if 'file' not in request.files:
abort(400, "File not found")
file = request.files['file']
if file.filename == '':
abort(400, "File name not found")
if file:
obj = request.files.get("file")
print(obj)
obj.save(os.path.join(os.getcwd(), "data", obj.filename))
paragraphs = textract.process("data//" + obj.filename)
with open('data//'+obj.filename+'.txt', 'wb', encoding="utf-8") as f:
f.write(paragraphs)
filePath = 'faiss_document_store.db'
if os.path.exists(filePath):
os.remove(filePath)
document_store = FAISSDocumentStore(
embedding_dim=128, faiss_index_factory_str="Flat")
dicts = convert_files_to_docs(
dir_path='data//', clean_func=clean_wiki_text, split_paragraphs=True)
document_store.write_documents(dicts)
retriever = DensePassageRetriever(
document_store=document_store,
query_embedding_model="vblagoje/dpr-question_encoder-single-lfqa-wiki",
passage_embedding_model="vblagoje/dpr-ctx_encoder-single-lfqa-wiki",
)
document_store.update_embeddings(retriever)
retriever.save("retriever.pt")
document_store.save('my_faiss_index.faiss')
return jsonify({"message": "Data uploaded successfully"}), 200
else:
abort(400, "File not found")
@ app.route('/read_url', methods=['POST'])
def read_url():
request_data = request.get_json()
if 'url' not in request_data:
abort(400, "URL not specified")
url = request_data['url']
if url:
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html')
text = ''
filename = url.split("/")[-1]
for paragraph in soup.find_all('p'):
text += paragraph.text
text = clean_wiki_text(text)
with open('data//'+filename+'.txt', 'w', encoding="utf-8") as f:
f.write(text)
dicts = convert_files_to_docs(
dir_path='data//', clean_func=clean_wiki_text, split_paragraphs=True)
document_store.write_documents(dicts)
retriever = DensePassageRetriever(
document_store=document_store,
query_embedding_model="vblagoje/dpr-question_encoder-single-lfqa-wiki",
passage_embedding_model="vblagoje/dpr-ctx_encoder-single-lfqa-wiki",
)
document_store.update_embeddings(retriever)
retriever.save("retriever.pt")
document_store.save('my_faiss_index.faiss')
return jsonify({"message": "Data uploaded successfully"}), 200
else:
abort(400, "File not found")
@ app.route('/process_query', methods=['POST'])
def process_query():
if 'query' not in request.json:
abort(400, "Query not found")
query = request.json['query']
if query == '':
abort(400, "Query not found")
if query:
document_store = FAISSDocumentStore.load(
index_path='my_faiss_index.faiss'
)
retriever = DensePassageRetriever.load(
"retriever.pt", document_store=document_store)
# retriever = DensePassageRetriever(
# document_store=document_store,
# query_embedding_model="vblagoje/dpr-question_encoder-single-lfqa-wiki",
# passage_embedding_model="vblagoje/dpr-ctx_encoder-single-lfqa-wiki",
# )
pipe = GenerativeQAPipeline(generator, retriever)
res = pipe.run(
query=query, params={"Retriever": {"top_k": 3}}
)
return jsonify({"results": res}), 200
else:
abort(400, "Query not found")
if __name__ == '__main__':
app.run(host='0.0.0.0', debug=True, port=5000)
This is the code I am using
Maybe you can find useful information on this issue: https://github.com/deepset-ai/haystack/issues/1019
I agree with @anakin87, the solution can be found in the linked issue. Please have a look and let us know if you need any more assistance :slightly_smiling_face:
Hi,
The solution mentioned above is for an older version of haystack. Its not working with the new version
On Thu, May 12, 2022 at 2:41 PM Stefano Fiorucci @.***> wrote:
Maybe you can find useful information on this issue: #1019 https://github.com/deepset-ai/haystack/issues/1019
— Reply to this email directly, view it on GitHub https://github.com/deepset-ai/haystack/issues/2532#issuecomment-1124727335, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVULJY76NURQWLFR22Y453VJTDLPANCNFSM5VXGF2PQ . You are receiving this because you authored the thread.Message ID: @.***>
-- Thanks & Regards. Pratik.A.Kotian Mobile:9004871867
Ok, I will try to replicate this. In the meantime I can give you some hints for debugging:
- On which version of Haystack does this code work fine? That will help me a lot figure out what's wrong.
- I see you have a
document_storeinstance in the global scope (line 22) which is initialized differently from the others and, apparently, never used. What do you need it for? - i see you save or load the document store at each request. Are you sure your code is threadsafe? Different requests might be trying to save concurrently.
In general I believe you might be able to spot the bug by simplifying your example and saving/loading FAISS only at startup and shutdown. This will already solve you a lot of headaches 🙂
@pratikkotian04 did you have the chance to gather some extra context to help us debugging the issue?
@pratikkotian04 the function call that triggers the error message is _validate_index_sync(): https://github.com/deepset-ai/haystack/blob/325bc5466a2490ca22fedb54a4b647a3a10983c5/haystack/document_stores/faiss.py#L183
What you could when you close your application is to check whether this check is successful at that time. When you start your application again, debugging could help you find out why get_document_count() does not correspond to get_embedding_count(). Usually that happens if the path to the sql database is incorrect and thus the document count is 0.
I am also facing this issue.
ValueError: The number of documents present in the SQL database (27) does not match the number of embeddings in FAISS (0). Make sure your FAISS configuration file correctly points to the same database that was used when creating the original index.

Hi @Parathantl did you have a chance to look at issue #1019 for help with this problem? It explains how to save and reload a FAISSDocumentStore. Have a look and let us know if you need any more help 🙂
Hi @Parathantl, closing this issue now as it seems to be stale. Feel free to re-open it if you still have problems :)