The error message sqlite3.OperationalError: too many SQL variables in your Python script suggests that your program is trying to execute an SQLite query with more variables than SQLite can handle.

Open AIGuyBNE opened this issue 2 years ago • 1 comments

Looks like there is a maximum amount of records that can be ingested. I attempted to import almost 1 GB of html and attachments export from Confluence page with about 1000 pages. It looked to run through the embedding process for the first 4 hours or so but then failed to submit the embeddings:

File "/home/folder/miniconda3/envs/localGPT/lib/python3.10/site-packages/chromadb/db/mixins/embeddings_queue.py", line 145, in submit_embeddings results = cur.execute(sql, params).fetchall() sqlite3.OperationalError: too many SQL variables

From earlier on doing the embeddings: 2023-12-10 17:49:02,064 - INFO - ingest.py:154 - Loaded 588 documents from /home/folder/localGPT/SOURCE_DOCUMENTS 2023-12-10 17:49:02,064 - INFO - ingest.py:155 - Split into 148082 chunks of text

When trying with a smaller subset I get the following error:

File "/home/folder/miniconda3/envs/localGPT/lib/python3.10/site-packages/InstructorEmbedding/instructor.py", line 524, in encode if isinstance(sentences[0],list): IndexError: list index out of range

Do I need to clear somthing before running the ingestion script again on a smaller subset?

Dec 10 '23 10:12 AIGuyBNE

I think the following may help... replace the main in ingest.py with the below

def main(device_type):
    # Load documents and split in chunks
    logging.info(f"Loading documents from {SOURCE_DIRECTORY}")
    documents = load_documents(SOURCE_DIRECTORY)
    text_documents, python_documents = split_documents(documents)
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    python_splitter = RecursiveCharacterTextSplitter.from_language(
        language=Language.PYTHON, chunk_size=880, chunk_overlap=200
    )
    texts = text_splitter.split_documents(text_documents)
    texts.extend(python_splitter.split_documents(python_documents))
    logging.info(f"Loaded {len(documents)} documents from {SOURCE_DIRECTORY}")
    logging.info(f"Split into {len(texts)} chunks of text")

    def split_list(input_list, chunk_size):
        for i in range(0, len(input_list), chunk_size):
            yield input_list[i:i + chunk_size]
        
    split_docs_chunked = split_list(texts, 5400)


    # Create embeddings
    embeddings = HuggingFaceInstructEmbeddings(
        model_name=EMBEDDING_MODEL_NAME,
        model_kwargs={"device": device_type},
    )
    # change the embedding type here if you are running into issues.
    # These are much smaller embeddings and will work for most appications
    # If you use HuggingFaceEmbeddings, make sure to also use the same in the
    # run_localGPT.py file.

    # embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)

    # db = Chroma.from_documents(
    #     texts,
    #     embeddings,
    #     persist_directory=PERSIST_DIRECTORY,
    #     client_settings=CHROMA_SETTINGS,

    # )
   
    for split_docs_chunk in split_docs_chunked:
        vectordb = Chroma.from_documents(
            documents=split_docs_chunk,
            embedding=embeddings,
            persist_directory=PERSIST_DIRECTORY,
            client_settings=CHROMA_SETTINGS,
        )
        vectordb.persist()

note the new split_list function and the for loop that uses the output of the split_list function i.e. split_docs_chunked

the old code (db = Chroma.from_documents( ... ) is commented out

Jan 06 '24 11:01 fvaneijk