请问读不到标题是因为pdf有什么问题呢
Traceback (most recent call last):
File "chat_paper.py", line 412, in
File "chat_paper.py", line 382, in main
paper_list = [Paper(path=args.pdf_path)]
File "/Users/apple/Documents/workspace/python/ChatPaper/get_paper_from_pdf.py", line 14, in init
self.title = self.get_title()
File "/Users/apple/Documents/workspace/python/ChatPaper/get_paper_from_pdf.py", line 121, in get_title
font_size = block["lines"][0]["spans"][0]["size"] # 获取第一行第一段文字的字体大小
IndexError: list index out of range
是想要读一篇IEEE的文章,如果开发者那边有access的话,标题是 Encryption-based Coordinated Volt/Var Control for Distribution Networks with Multi-Microgrids. 非常感谢!
这是一篇本地的pdf文档吗?能贴一个你的命令行指令,以及pdf第一页的截图吗?
是的,是本地的pdf。 指令如下(换了一篇手边的文章): python chat_paper.py --pdf_path ~/Super-Resolution_Perception_Assisted_Spatiotemporal_Graph_Deep_Learning_against_False_Data_Injection_Attacks_in_Smart_Grid.pdf 截图如下:
在第一张图中我尝试去掉了页面上端的内容但是还是会报一样的错误。
不好意思,稍等我一天左右的时间,我把这块的代码重新梳理一下,这块的逻辑比较复杂
感谢您的劳动!感觉解析pdf确实很麻烦
Hi, Let me give a suggestion: maybe just find the biggest font in first page is enough.
Hi, Let me give a suggestion: maybe just find the biggest font in first page is enough.
其实这个方案我也曾经试过,但由于PDF的格式太乱了,很多标题的字体甚至并不是最大的一个!比如说Arxiv中,有些论文中,Arxiv的字体并不比标题的小!而在我们的总结中,标题的信息量并不是很多,一般用户在搜索的过程中,是知道标题是什么的,所以我们就没有把这个功能做出更多的优化。
English: In fact, I have tried this solution, but because the format of the PDF is too chaotic, many fonts of the title are not even the biggest! For example, in some papers of arxiv, their fonts are not smaller than the title! In our ChatPaper, the amount of information on the title is not much. In the process of searching, the general user knows what the title is, so we have not done more optimization at this point~
(hello我看到completed了 但是我现在新clone的版本还是无法读取到之前未能读取的pdf。。
(hello我看到completed了 但是我现在新clone的版本还是无法读取到之前未能读取的pdf。。
本地版本的吗?你是Windows还是?论文的标题字符串可能过长也无法读取~
是mac。也有可能。。但是同长度的之前读取成功了