Delete all URLs and email addresses option does not work when uploading Markdown documents
Self Checks
- [X] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [X] I have searched for existing issues search for existing issues, including closed ones.
- [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [X] Please do not modify this template :) and fill in all the required fields.
Dify version
0.6.16
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
- Check Delete all URLs and email addresses option when uploading Markdown file
- Wait for the file to be processed and view the text
✔️ Expected Behavior
When the uploaded text is in txt format and the same preprocessing rules are used, the URL link is preserved
❌ Actual Behavior
When the uploaded text was in md format and the same preprocessing rules were used, the URL links were all cleaned up
@Menghuan1918 Hi there! I'm a bot that can help you solve bugs, answer questions, and become a contributor. I'm here to assist you while waiting for a human maintainer.
The "Delete all URLs and email addresses" option is not working correctly for Markdown files in Dify version 0.6.16 because the preprocessing rules for cleaning data, including deleting URLs and email addresses, may not be fully compatible or correctly implemented for Markdown files. The MarkdownExtractor class in the Dify codebase shows that there are specific methods for removing hyperlinks and images from Markdown content, but it does not explicitly handle email addresses.
def remove_hyperlinks(self, content: str) -> str:
"""Remove hyperlinks from markdown content."""
pattern = r"\[(.*?)\]\((.*?)\)"
content = re.sub(pattern, r"\1", content)
return content
The remove_hyperlinks method only removes hyperlinks formatted in Markdown syntax but does not address email addresses. This suggests that the current implementation of the preprocessing rules might not be effectively parsing or recognizing email addresses within the Markdown syntax, leading to the observed issue [1][2].
To continue talking to Dosu, mention @dosu.
same issue with .doc files, to fix I convert documents in pdf, but it's not a good solution.