dify icon indicating copy to clipboard operation
dify copied to clipboard

chore: make doc extractor node also can extract text by file extension

Open hjlarry opened this issue 1 year ago • 4 comments

Checklist:

[!IMPORTANT]
Please review the checklist below before submitting your pull request.

  • [ ] Please open an issue before creating a PR or link to an existing issue
  • [x] I have performed a self-review of my own code
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [x] I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

Description

currently, extract doc by the mimetype is not so much reliable. for example, the markdown file will always raise error:

image

so extract doc by the filename when can't recognize the mimetype can improve success rate

Type of Change

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] This change requires a documentation update, included: Dify Document
  • [x] Improvement, including but not limited to code refactoring, performance optimization, and UI/UX improvement
  • [ ] Dependency upgrade

Testing Instructions

test locally

  • [ ] Test A
  • [ ] Test B

hjlarry avatar Oct 21 '24 07:10 hjlarry

Do you mind providing a file that will cause the error? According to my attempt, the .md file can be correctly identified as text/markdown.

laipz8200 avatar Oct 22 '24 18:10 laipz8200

Do you mind providing a file that will cause the error? According to my attempt, the .md file can be correctly identified as text/markdown.

I tried these files, the mimetype is always application/octet-stream : 2024-09-27.md Markdown 101 (Example).md Just chat.zip

I think this behavior is depends on the browser, I use Edge of windows11 for test

hjlarry avatar Oct 23 '24 01:10 hjlarry

someone else encounter this issue https://github.com/langgenius/dify/issues/9757

I think for the remote_file extract by mimetype, for the local file extract by extension is more reasonable

I tried same file with firefox browser, it works, so the mimetype of local file depends on different browser

hjlarry avatar Oct 24 '24 05:10 hjlarry

lgtm

yaoice avatar Oct 24 '24 06:10 yaoice