dify Support Contextual Retrieval

Self Checks

[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

https://www.anthropic.com/news/contextual-retrieval

<document> 
{{WHOLE_DOCUMENT}} 
</document> 
Here is the chunk we want to situate within the whole document 
<chunk> 
{{CHUNK_CONTENT}} 
</chunk> 
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

2. Additional context or comments

have 2 problem:

should support change llm for explaining the chunks, maybe can use system llm in the beginning.
If file content context large than max context size which llm supported, maybe not explain/summarize automatically ?

3. Can you help us with this feature?

[X] I am interested in contributing to this feature.

Sep 25 '24 14:09 Weaxs

What is the current implementation progress of contextual retrieval? @Weaxs

We are really interested in this feature and we would like to help to implement this.

Oct 14 '24 13:10 tobegit3hub

@Weaxs Hi. Is any update in this feature?

Oct 17 '24 03:10 FreshLucas-git

@Weaxs Hi. Is any update in this feature?

sorry, I do not start this feature yet.

I will try to figure it out before Nov. , and submit pr and review maybe ... Dec. I’ll try as soon as possible. 🥺

Oct 17 '24 06:10 Weaxs

user chooses support contextual-retrieval

user chooses to enabel [contextual-retrieval] in front-end step-two
save rule in DatasetProcessRule (table: dataset_process_rules)

document upload and process

add ContextualRecursiveCharacterTextSplitter for contextual-retrieval text splitter (call _text_splitter_instance.split_text)
text splitter by _text_splitter_instance with token - 50~100
assemble prompt with total document and chunk, and call summarize_model_instance for summarize contextual message (this will consume system llm tokens)
joint the summary with the chunk to form a new chunk

other problems

user edit and update chunk after upload and index Without processing. The chunk will change to appear as edited by the user, with no additional processing.
tokens which joint the whole document and chunk exceeds summarize model max_tokens maybe call a new api in front-end step-two for telling user can not support contextual-retrieval when use current system llm

Oct 29 '24 10:10 Weaxs

Hi, @Weaxs. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary:

The issue requests support for contextual retrieval to improve document chunk explanations.
You identified the need to change the language model for chunk explanation and handle large file contents.
@tobegit3hub and @FreshLucas-git showed interest in the feature.
You planned to work on this and submit a pull request by December, including a detailed implementation plan.

Next Steps:

Please let me know if this issue is still relevant to the latest version of the Dify repository by commenting here.
If there is no further activity, the issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

Nov 29 '24 16:11 dosubot[bot]