slbayer comments

Results 11 comments of


                                            slbayer

Remove `ghostscript` dependency

Random stranger here: the current version of `camelot` uses ghostscript, and the table detection script in `master` still uses `camelot`.

No module named 'sklearn.feature_selection.rfe' for HeadingLevelPrediction

The problem is that the model was built with an old version of `sklearn` that had this module. According to the warning I get after doing several horrid things with...

Add support to ignore files

Here's another use case: I have a package right now which contains a `resources` subdirectory that I want to have distributed with the package, and that subdirectory contains a Python...

hOCR renderer can generate zero-length words

Here's a modification of the fix proposed in #836 which addresses this issue: ``` def write_word(self) -> None: if len(self.working_text) > 0: txt = self._clean_text(self.working_text.strip()) if len(txt) > 0: bold_and_italic_styles...

New hOCR renderer fails to escape or clean text properly

Further testing reveals that if the string in the document had been ``, the angle brackets would not have been escaped properly either.

New hOCR renderer fails to escape or clean text properly

This needs to be fixed in two places. In release 20221105, in `converter.py`, line 934 should be `enc(self.working_text.strip()),` instead of `self.working_text.strip(),` and line 913 should be `self.write(enc(text))` instead of `self.write(text)`

New hOCR renderer fails to escape or clean text properly

Actually, I've now discovered something very closely related: if the `stripcontrol` attribute of the `HOCRConverter` is `False`, at least the `lxml` XML parser will fail on zero bytes (`\x00`). And...

New hOCR renderer renders duplicate HTML IDs

E.g., in release 20221105, in `converter.py`, line 947, change `"\n"` to `"\n"` and at line 962 - 3, change ``` "\n" % (item.index, self.bbox_repr(item.bbox)) ``` to ``` "\n" % (ltpage.pageid,...

New hOCR renderer drops characters at font conversion points.

In commit [5114acd](https://github.com/pdfminer/pdfminer.six/commit/5114acdda61205009221ce4ebf2c68c144fc4ee5), the bug is at line 1005 in `convertor.py`: ``` if ( self.working_bbox[1] != item.bbox[1] or self.working_font != item.fontname or self.working_size != item.size ): self.write_word() self.working_bbox = item.bbox...

Avoiding overlapping citation extents in find_citations.py

One of the problems with pattern-matching approaches, as opposed to something more statistical, is that it's pretty much all or nothing when you find an edge case. I don't know...