数据集中的公式部分,有部分定位不准确,我觉得这部分是脏数据,有没有什么方法可以过滤掉这部分
两张图片是在sample某一张图片的ann.jpg和ari.jpg对应的部分截图
Hi, this kind of coordinate shift naturally occurs in PDFMiner and hardly to be avoided.
Is there a way to filter by font or cid symbol?
Have you solved the problem?
Up ! Am I right if I say that this render the full dataset completely useless ? Or does this affect only a small part of it ?
@VikingKang @ltss1988 @Bonjour123 This problem affects only the equation, and hard to filter. These tokens seem to cover other tokens in the rendering, but they will not affect other tokens in the actual use.
Well, if so can you help me understand why I need to apply some random horizontal dilatation to make the bounding boxes match ? Here are some randomly picked examples:
- 1.tar_1401.0001.gz_infoingames_without_metric_arxiv_0
- 1.tar_1401.0001.gz_infoingames_without_metric_arxiv_24
- 1.tar_1401.0007.gz_hhmerge_3
I first thought of some bugs in the visualisation implementation, but I get the same problem using both PIL (red) and cv2 (blue).
@liminghao1630 @VikingKang @ltss1988 Don't you have the same problem ?
#19 @Bonjour123