Emmett McFarlane comments

Results 16 comments of


                                            Emmett McFarlane

Unterminated string in JSON at position 16384 (line 1 column 16385)

It looks like the LLM hit the 16K token generation limit. This is a limitation of the language model, so trying other models with larger token limits can help. Using...

Pagewise Markdown output

For those still looking for page-wise markdown extraction, [the library markitdown is based on](https://github.com/emcf/thepipe) has this feature

output file/folder?

Hi @Fuckingnameless , it looks like this is a downstream failure as a result of #34 . Replied there ps. The output folder is created in the directory the command...

Local scrape_file failing for some PDFs with out of memory

Hi @camrail , I've introduced some additional options `rescale: float`, `input_images: bool`, and `output images: bool` into the `scraper.scrape_pdf` function to ease memory usage (this creates a tradeoff that may...

Use Marker for PDF text extraction

Marker is great, but unfortunately, the idea of a heuristic pipeline with multiple fine-tuned specialized models ignores [the bitter lesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html). I only see a future for PDF extraction using general...

Table exctraction from PDF is advertised but completely absent

If you're still looking to accurately extract the tables from PDF check out this [library](https://github.com/emcf/thepipe)