The error message sqlite3.OperationalError: too many SQL variables in your Python script suggests that your program is trying to execute an SQLite query with more variables than SQLite can handle.
Looks like there is a maximum amount of records that can be ingested. I attempted to import almost 1 GB of html and attachments export from Confluence page with about 1000 pages. It looked to run through the embedding process for the first 4 hours or so but then failed to submit the embeddings:
File "/home/folder/miniconda3/envs/localGPT/lib/python3.10/site-packages/chromadb/db/mixins/embeddings_queue.py", line 145, in submit_embeddings results = cur.execute(sql, params).fetchall() sqlite3.OperationalError: too many SQL variables
From earlier on doing the embeddings: 2023-12-10 17:49:02,064 - INFO - ingest.py:154 - Loaded 588 documents from /home/folder/localGPT/SOURCE_DOCUMENTS 2023-12-10 17:49:02,064 - INFO - ingest.py:155 - Split into 148082 chunks of text
When trying with a smaller subset I get the following error:
File "/home/folder/miniconda3/envs/localGPT/lib/python3.10/site-packages/InstructorEmbedding/instructor.py", line 524, in encode if isinstance(sentences[0],list): IndexError: list index out of range
Do I need to clear somthing before running the ingestion script again on a smaller subset?
I think the following may help... replace the main in ingest.py with the below
def main(device_type):
# Load documents and split in chunks
logging.info(f"Loading documents from {SOURCE_DIRECTORY}")
documents = load_documents(SOURCE_DIRECTORY)
text_documents, python_documents = split_documents(documents)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON, chunk_size=880, chunk_overlap=200
)
texts = text_splitter.split_documents(text_documents)
texts.extend(python_splitter.split_documents(python_documents))
logging.info(f"Loaded {len(documents)} documents from {SOURCE_DIRECTORY}")
logging.info(f"Split into {len(texts)} chunks of text")
def split_list(input_list, chunk_size):
for i in range(0, len(input_list), chunk_size):
yield input_list[i:i + chunk_size]
split_docs_chunked = split_list(texts, 5400)
# Create embeddings
embeddings = HuggingFaceInstructEmbeddings(
model_name=EMBEDDING_MODEL_NAME,
model_kwargs={"device": device_type},
)
# change the embedding type here if you are running into issues.
# These are much smaller embeddings and will work for most appications
# If you use HuggingFaceEmbeddings, make sure to also use the same in the
# run_localGPT.py file.
# embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
# db = Chroma.from_documents(
# texts,
# embeddings,
# persist_directory=PERSIST_DIRECTORY,
# client_settings=CHROMA_SETTINGS,
# )
for split_docs_chunk in split_docs_chunked:
vectordb = Chroma.from_documents(
documents=split_docs_chunk,
embedding=embeddings,
persist_directory=PERSIST_DIRECTORY,
client_settings=CHROMA_SETTINGS,
)
vectordb.persist()
note the new split_list function and the for loop that uses the output of the split_list function i.e. split_docs_chunked
the old code (db = Chroma.from_documents( ... ) is commented out