dify icon indicating copy to clipboard operation
dify copied to clipboard

Full-text entry support for the knowledge base rather than being divided into several paragraphs

Open jiaqianjing opened this issue 1 year ago • 1 comments

Self Checks

  • [X] I have searched for existing issues search for existing issues, including closed ones.
  • [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [X] Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

have many articles, but none of them are too long, about more than 1,000 characters. However, I hope they can be retrieved separately instead of being split into several fragments. Because they are all independently and identically distributed. And the current model context has long supported characters at the million-level. There is no need to limit it so strictly (no more than 1,000 for each paragraph). This is very inconvenient and will bring errors, because the retrieved things may be omitted or pieces of different articles may be spliced together, which is not allowed.

2. Additional context or comments

No response

3. Can you help us with this feature?

  • [ ] I am interested in contributing to this feature.

jiaqianjing avatar Aug 27 '24 15:08 jiaqianjing

I agree

Sakura4036 avatar Aug 28 '24 01:08 Sakura4036

I thought you do not want Chinese text segmentation ?

If 1,000 characters are Chinese and use Qdrant as vector db, maybe you can try the follow solution:

  1. User Qdrant official Image like qdrant/qdrant, not dify Qdrant langgenius/qdrant. Official Qdrant do not support Chinese text segmentation.

  2. Maybe you can update full-text index after create knowledge by Dify ? —— not recommend

image image

Weaxs avatar Aug 28 '24 02:08 Weaxs

I thought you do not want Chinese text segmentation ?我以为您不想进行中文文本分词?

If 1,000 characters are Chinese and use Qdrant as vector db, maybe you can try the follow solution:

  1. User Qdrant official Image like qdrant/qdrant, not dify Qdrant langgenius/qdrant. Official Qdrant do not support Chinese text segmentation.用户 Qdrant 官方图片如qdrant/qdrant,非 dify Qdrant langgenius/qdrant。官方 Qdrant 不支持中文文本分词。
  2. Maybe you can update full-text index after create knowledge by Dify ? —— not recommend

image image

First of all, thank you very much for your reply. I really don't want my Chinese documents to be segmented. I want the whole document to be retrieved and used as context, along with the prompt as an input to LLM. At the moment I think the easiest way to do this would be to relax the 1000 limit, say to 5000. I haven't looked at the dify implementation, and I don't know what the reason is for this limit. image

jiaqianjing avatar Aug 30 '24 15:08 jiaqianjing

I thought you do not want Chinese text segmentation ?我以为您不想进行中文文本分词? If 1,000 characters are Chinese and use Qdrant as vector db, maybe you can try the follow solution:

  1. User Qdrant official Image like qdrant/qdrant, not dify Qdrant langgenius/qdrant. Official Qdrant do not support Chinese text segmentation.用户 Qdrant 官方图片如qdrant/qdrant,非 dify Qdrant langgenius/qdrant。官方 Qdrant 不支持中文文本分词。
  2. Maybe you can update full-text index after create knowledge by Dify ? —— not recommend

image image

First of all, thank you very much for your reply. I really don't want my Chinese documents to be segmented. I want the whole document to be retrieved and used as context, along with the prompt as an input to LLM. At the moment I think the easiest way to do this would be to relax the 1000 limit, say to 5000. I haven't looked at the dify implementation, and I don't know what the reason is for this limit. image

oh, sorry I misunderstood before.

If you want to change max segement tokens length for self-host dify, you can set INDEXING_MAX_SEGMENTATION_TOKENS_LENGTH in .env, like:

INDEXING_MAX_SEGMENTATION_TOKENS_LENGTH=5000

If you use dify cloud, hmmmm.. maybe ask for @takatost

Weaxs avatar Sep 02 '24 03:09 Weaxs

@jiaqianjing you use in cloud or self host ?

Weaxs avatar Sep 02 '24 05:09 Weaxs

dify cloud @Weaxs

jiaqianjing avatar Sep 03 '24 09:09 jiaqianjing

dify cloud @Weaxs

INDEXING_MAX_SEGMENTATION_TOKENS_LENGTH in dify cloud has raised to 4000 tokens already.

but maybe have no plan to raise to 5000 tokens I gusses...

image

Weaxs avatar Sep 19 '24 08:09 Weaxs