Hyperlinks are lost after xlsx parsing
Self Checks
- [X] I have searched for existing issues search for existing issues, including closed ones.
- [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [X] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [X] Please do not modify this template :) and fill in all the required fields.
1. Is this request related to a challenge you're experiencing? Tell me about your story.
The original text I had was like this.
But what I got was like this.
2. Additional context or comments
I hope that the hyperlink will not be lost after parsing and will be presented in markdown.
3. Can you help us with this feature?
- [X] I am interested in contributing to this feature.
@ic-xu 🌟
To retain and present hyperlinks in markdown format after parsing an XLSX file, you need to ensure that the remove_hyperlinks flag is set to False when initializing the MarkdownExtractor. This will prevent the remove_hyperlinks method from stripping out the hyperlinks from the content.
Here is an example of how you can initialize the MarkdownExtractor to retain hyperlinks:
extractor = MarkdownExtractor(file_path="path_to_your_file.xlsx", remove_hyperlinks=False)
documents = extractor.extract()
By setting remove_hyperlinks=False, the hyperlinks in the markdown content will be preserved [1].
To continue talking to Dosu, mention @dosu.