dify icon indicating copy to clipboard operation
dify copied to clipboard

RAG stuck in embedding with status 'indexing' when using Segmenting in Q&A Format

Open jafrank88 opened this issue 1 year ago • 12 comments

Self Checks

  • [X] This is only for bug report, if you would like to ask a question, please head to Discussions.
  • [X] I have searched for existing issues search for existing issues, including closed ones.
  • [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.6.8

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Find 1000 files. Try to embed them when using "Segmenting in Question & Answer format". There is no way to tell when the process is complete, but when checking the log, it appears that some files get stuck with the status of 'indexing', rather than 'available'. See generally, https://discord.com/channels/1082486657678311454/1237858420351041576

✔️ Expected Behavior

All embedded RAG to be 'available' or to have a way to restart the process to fix those stuck in indexing status.

❌ Actual Behavior

All files 'available' after embedding.

jafrank88 avatar May 21 '24 23:05 jafrank88

To resolve the issue where some files remain stuck in the 'indexing' status when embedding 1000 files using the "Segmenting in Question & Answer format" in Dify software version 0.6.8, self-hosted with Docker, you should ensure that the Celery worker service is properly configured and running. The Celery worker is responsible for processing the queue, which includes tasks like indexing files.

Here are the steps to check and ensure the Celery worker is properly configured:

  1. Check the Celery Worker Configuration: Ensure that the worker service is correctly defined in your docker-compose.yaml file. The configuration should look like this:

    worker:
      image: langgenius/dify-api:0.6.8
      restart: always
      environment:
        CONSOLE_WEB_URL: ''
        MODE: worker
        LOG_LEVEL: INFO
        SECRET_KEY: sk-9f73s3ljTXVcMT3Blb3ljTqtsKiGHXVcMT3BlbkFJLK7U
        DB_USERNAME: postgres
        DB_PASSWORD: difyai123456
        DB_HOST: db
        DB_PORT: 5432
        DB_DATABASE: dify
        REDIS_HOST: redis
        REDIS_PORT: 6379
        REDIS_USERNAME: ''
        REDIS_PASSWORD: difyai123456
        REDIS_DB: 0
        REDIS_USE_SSL: 'false'
        CELERY_BROKER_URL: redis://:difyai123456@redis:6379/1
        STORAGE_TYPE: local
        STORAGE_LOCAL_PATH: storage
        VECTOR_STORE: weaviate
        WEAVIATE_ENDPOINT: http://weaviate:8080
        WEAVIATE_API_KEY: WVF5YThaHlkYwhGUSmCRgsX3tD5ngdN8pkih
    
  2. Ensure the Worker Service is Running: Make sure the worker service is up and running. You can check the status of the Docker containers by running:

    docker-compose ps
    

    If the worker service is not running, start it with:

    docker-compose up -d worker
    
  3. Check Logs for Errors: Inspect the logs of the worker service to see if there are any errors that might be causing the files to remain stuck in the 'indexing' status. You can view the logs with:

    docker-compose logs worker
    
  4. Verify API Usage and Quotas: Ensure that the embedding model is properly configured and that you have not exceeded any quotas. You can use the Dify Knowledge API to check the status of document embeddings and manage your datasets. For example, to get the document embedding status, you can use the following API call:

    curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{batch}/indexing-status' \
    --header 'Authorization: Bearer {api_key}'
    

    Additionally, check for any error messages such as provider_quota_exceeded or provider_not_initialize which indicate issues with the embedding model configuration or quota limits.

  5. Check for Document Limits and Errors in the Indexing Task: The document_indexing_task function in api/tasks/document_indexing_task.py includes checks for document limits and handles errors that might occur during the indexing process. Ensure that your batch upload does not exceed the configured limits and that your subscription allows for the number of documents you are trying to index. If an error occurs, the document's indexing_status will be set to 'error' and the error message will be recorded.

By ensuring that the Celery worker is properly configured and running, verifying the API usage and quotas, and checking for document limits and errors in the indexing task, you should be able to resolve the issue with files remaining stuck in the 'indexing' status.

If the issue persists, you can manually trigger a retry for documents stuck in the 'indexing' status using the recover_document_indexing_task function. Here is the relevant code snippet that shows how to manually trigger a retry for documents:

import logging
import time

import click
from celery import shared_task
from werkzeug.exceptions import NotFound

from core.indexing_runner import DocumentIsPausedException, IndexingRunner
from extensions.ext_database import db
from models.dataset import Document


@shared_task(queue='dataset')
def recover_document_indexing_task(dataset_id: str, document_id: str):
    """
    Async recover document
    :param dataset_id:
    :param document_id:

    Usage: recover_document_indexing_task.delay(dataset_id, document_id)
    """
    logging.info(click.style('Recover document: {}'.format(document_id), fg='green'))
    start_at = time.perf_counter()

    document = db.session.query(Document).filter(
        Document.id == document_id,
        Document.dataset_id == dataset_id
    ).first()

    if not document:
        raise NotFound('Document not found')

    try:
        indexing_runner = IndexingRunner()
        if document.indexing_status in ["waiting", "parsing", "cleaning"]:
            indexing_runner.run([document])
        elif document.indexing_status == "splitting":
            indexing_runner.run_in_splitting_status(document)
        elif document.indexing_status == "indexing":
            indexing_runner.run_in_indexing_status(document)
        end_at = time.perf_counter()
        logging.info(click.style('Processed document: {} latency: {}'.format(document.id, end_at - start_at), fg='green'))
    except DocumentIsPausedException as ex:
        logging.info(click.style(str(ex), fg='yellow'))
    except Exception:
        pass

To manually trigger a retry for documents stuck in the 'indexing' status, you can use the following command:

recover_document_indexing_task.delay(dataset_id, document_id)

Replace dataset_id with the ID of your dataset and document_id with the ID of the document you want to retry.

Additionally, the Dify platform provides an API endpoint to retry indexing tasks for documents. You can use the DocumentRetryApi to retry multiple documents by sending a POST request with the document_ids in the request body. Here is an example of how to use this API:

class DocumentRetryApi(DocumentResource):
    @setup_required
    @login_required
    @account_initialization_required
    def post(self, dataset_id):
        """retry document."""

        parser = reqparse.RequestParser()
        parser.add_argument('document_ids', type=list, required=True, nullable=False,
                            location='json')
        args = parser.parse_args()
        dataset_id = str(dataset_id)
        dataset = DatasetService.get_dataset(dataset_id)
        retry_documents = []
        if not dataset:
            raise NotFound('Dataset not found.')
        for document_id in args['document_ids']:
            try:
                document_id = str(document_id)

                document = DocumentService.get_document(dataset.id, document_id)

                # 404 if document not found
                if document is None:
                    raise NotFound("Document Not Exists.")

                # 403 if document is archived
                if DocumentService.check_archived(document):
                    raise ArchivedDocumentImmutableError()

                # 400 if document is completed
                if document.indexing_status == 'completed':
                    raise DocumentAlreadyFinishedError()
                retry_documents.append(document)
            except Exception as e:
                logging.error(f"Document {document_id} retry failed: {str(e)}")
                continue
        # retry document
        DocumentService.retry_document(dataset_id, retry_documents)

        return {'result': 'success'}, 204

You can call this API endpoint to retry indexing for multiple documents by sending a POST request to /datasets/<uuid:dataset_id>/retry with the document_ids in the request body.

Details

Sources


About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

dosubot[bot] avatar May 21 '24 23:05 dosubot[bot]

Here is an example of the problem without Q-and-A embedding, image

jafrank88 avatar May 26 '24 22:05 jafrank88

Hi - I am having a similar problem, not exactly the same as mentioned above but when trying to index 499 OCR pdf docs. Dify attempts to index but we get an error. When clicking on any of the docs with the error message, one can see that the processes has been attempted as there are example chunks from the document. It is not clear what the error is. Here are the settings being used. image

GitDakky avatar May 28 '24 17:05 GitDakky

Can you provide the logs of the container?

crazywoola avatar May 29 '24 11:05 crazywoola

@crazywoola unfortunately not. My IT man is off and I don't hold server credentials. I am 99% sure that all docs are OCR searchable, but have you run in to issues where users have uploaded docs that have no text? How does Dify deal with these? Just skip them?

GitDakky avatar May 29 '24 16:05 GitDakky

I posted my logs here and can provide more as needed - https://discord.com/channels/1082486657678311454/1237858420351041576

jafrank88 avatar May 29 '24 16:05 jafrank88

now its not liking docx - restarting server. image

GitDakky avatar May 29 '24 17:05 GitDakky

Just a theory! OpenAI invalidated the API key we were using and I set up a new key under a project. The API key uses an updated structure using an abirritation of the project name in the key "difi": Eg : sk-difi-XXXXXXXXXXetc...

We updated the Key in the only place available, but I do recall needing to provide an API key serval times for various OAI models during the setup. Is it possible that: a) The new key is not being used by the various OAI models b) The format of the new OAI key is conflicting with a validation check in the code?

image

GitDakky avatar May 29 '24 17:05 GitDakky

I ran again using ada-002 and it indexed. Something up with text-embeddings-3-large - My guess is it is related to the new API key.

OK - I am outta here. Standing by for updates

GitDakky avatar May 29 '24 17:05 GitDakky

I ran again using ada-002 and it indexed. Something up with text-embeddings-3-large - My guess is it is related to the new API key.

OK - I am outta here

GitDakky avatar May 29 '24 17:05 GitDakky

I am running everything locally.

jafrank88 avatar May 29 '24 18:05 jafrank88

I am not seeing any text chunks after embedding using Q&A segmentation. Is it possible that feature is broken? When I turn it off and embed files, I can see the chunks and can retrieve them. I am not sure if Q&A is the issue or the way it sends the content to the embedding API is the issue.

jafrank88 avatar May 29 '24 21:05 jafrank88

Any update on this?

GitDakky avatar Jun 10 '24 17:06 GitDakky

使用 Q&A 分段嵌入后,我没有看到任何文本块。该功能是否有可能被破坏?当我关闭它并嵌入文件时,我可以看到块并可以检索它们。我不确定是问答是问题还是将内容发送到嵌入 API 的方式是问题。

我也是相同的问题Q&A无法使用,看日志也没有报错

ouyang-yuxuan avatar Jun 19 '24 09:06 ouyang-yuxuan

@dosu Does splitting indicate that embedding is being performed? My document processing progress has been 0 but index_status changed from waiting to splitting, I checked the logs of the worker in docker and there is no error and it shows Start process document: d77c40d1-4120-411a-8a4d- 6ea8520c0bca

supuwoerc avatar Jul 11 '24 09:07 supuwoerc

@dosu Does splitting indicate that embedding is being performed? My document processing progress has been 0 but index_status changed from waiting to splitting, I checked the logs of the worker in docker and there is no error and it shows Start process document: d77c40d1-4120-411a-8a4d- 6ea8520c0bca

My problem was solved, it turned out to be just text-embedding-ada-002 processing was too slow, I was going to go to discord for help, when after pouring a glass of water, I found that the progress still started to change ....🤣

supuwoerc avatar Jul 11 '24 09:07 supuwoerc