chore: make doc extractor node also can extract text by file extension

Open hjlarry opened this issue 1 year ago • 4 comments

Checklist:

[!IMPORTANT]
Please review the checklist below before submitting your pull request.

[ ] Please open an issue before creating a PR or link to an existing issue
[x] I have performed a self-review of my own code
[x] I have commented my code, particularly in hard-to-understand areas
[x] I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

currently, extract doc by the mimetype is not so much reliable. for example, the markdown file will always raise error:

so extract doc by the filename when can't recognize the mimetype can improve success rate

[ ] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] This change requires a documentation update, included: Dify Document
[x] Improvement, including but not limited to code refactoring, performance optimization, and UI/UX improvement
[ ] Dependency upgrade

test locally

Oct 21 '24 07:10 hjlarry

Do you mind providing a file that will cause the error? According to my attempt, the .md file can be correctly identified as text/markdown.

Oct 22 '24 18:10 laipz8200

Do you mind providing a file that will cause the error? According to my attempt, the .md file can be correctly identified as text/markdown.

I tried these files, the mimetype is always application/octet-stream : 2024-09-27.md Markdown 101 (Example).md Just chat.zip

I think this behavior is depends on the browser, I use Edge of windows11 for test

Oct 23 '24 01:10 hjlarry

someone else encounter this issue https://github.com/langgenius/dify/issues/9757

I think for the remote_file extract by mimetype, for the local file extract by extension is more reasonable

I tried same file with firefox browser, it works, so the mimetype of local file depends on different browser

Oct 24 '24 05:10 hjlarry

lgtm

Oct 24 '24 06:10 yaoice