chore: make doc extractor node also can extract text by file extension
Checklist:
[!IMPORTANT]
Please review the checklist below before submitting your pull request.
- [ ] Please open an issue before creating a PR or link to an existing issue
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I ran
dev/reformat(backend) andcd web && npx lint-staged(frontend) to appease the lint gods
Description
currently, extract doc by the mimetype is not so much reliable. for example, the markdown file will always raise error:
so extract doc by the filename when can't recognize the mimetype can improve success rate
Type of Change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] This change requires a documentation update, included: Dify Document
- [x] Improvement, including but not limited to code refactoring, performance optimization, and UI/UX improvement
- [ ] Dependency upgrade
Testing Instructions
test locally
- [ ] Test A
- [ ] Test B
Do you mind providing a file that will cause the error? According to my attempt, the .md file can be correctly identified as text/markdown.
Do you mind providing a file that will cause the error? According to my attempt, the .md file can be correctly identified as text/markdown.
I tried these files, the mimetype is always application/octet-stream :
2024-09-27.md
Markdown 101 (Example).md
Just chat.zip
I think this behavior is depends on the browser, I use Edge of windows11 for test
someone else encounter this issue https://github.com/langgenius/dify/issues/9757
I think for the remote_file extract by mimetype, for the local file extract by extension is more reasonable
I tried same file with firefox browser, it works, so the mimetype of local file depends on different browser
lgtm