docling icon indicating copy to clipboard operation
docling copied to clipboard

Support Image path/url

Open vio-ez opened this issue 1 year ago • 2 comments

When I extract content from URL, for image only get placeholder <!-- image -->. I want to save image info like path/url in .doctags or .json


If this function support already, can anybody show me the code.

My code now is like :

import json
import logging
import time
from pathlib import Path
from docling.backend.html_backend import HTMLDocumentBackend 
from docling.datamodel.base_models import InputFormat
from docling.pipeline.simple_pipeline import PipelineOptions  

from docling.document_converter import DocumentConverter, SimplePipeline, HTMLFormatOption 
 
from docling_core.types.doc import ( 
    PictureItem, 
    TextItem,
)

_log = logging.getLogger(__name__)

logging.basicConfig(level=logging.INFO)


options = PipelineOptions()

s_pip = SimplePipeline(options)

doc_converter = DocumentConverter(
    allowed_formats=[ 
            # InputFormat.IMAGE, 
            InputFormat.HTML, 
        ], 
    format_options={
        InputFormat.HTML: HTMLFormatOption(
            pipeline_cls=SimplePipeline,  # class,not instance
            backend=HTMLDocumentBackend 
        )   
    }    
)       


url = 'https://huggingface.co/blog/aya-expanse'  

start_time = time.time()                      
conv_result = doc_converter.convert(url)     
end_time = time.time() - start_time


for item, level in conv_result.document.iterate_items():

    print('-- ', type(item), level)
    if isinstance(item, TextItem):
        print(item.text) 
    elif isinstance(item, PictureItem):
        print('-- ', item.label)  
        pass


_log.info(f"Document converted in {end_time:.2f} seconds.")

## Export results                        
# output_dir = Path("scratch")
# output_dir = '/Users/pc087/Documents/code/code24/03-pdf/docli/scratch'   
output_dir = Path("03-pdf/docli/scratch")        
output_dir.mkdir(parents=True, exist_ok=True)    
doc_filename = conv_result.input.file.stem          

print('-- output_dir : ', output_dir, doc_filename )   

# Export Deep Search document JSON format:
with (output_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
    fp.write(json.dumps(conv_result.document.export_to_dict(), ensure_ascii=False))
 
# Export Markdown format:
with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
    fp.write(conv_result.document.export_to_markdown())

# Export Document Tags format:
with (output_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp:
    fp.write(conv_result.document.export_to_document_tokens())

vio-ez avatar Nov 21 '24 12:11 vio-ez

Same question here

kime541200 avatar Nov 24 '24 13:11 kime541200

@kime541200 @ezscode We are actively working to support this case, a new release will bring this capability soon.

cau-git avatar Nov 25 '24 12:11 cau-git

@kime541200 @ezscode We are actively working to support this case, a new release will bring this capability soon.

Is this supported ? How to call it ?

vio-ez avatar Dec 19 '24 06:12 vio-ez

@ezscode Yes, please look here: https://github.com/DS4SD/docling/blob/main/docs/examples/export_figures.py#L73

We have the save_as_markdown and save_as_html with EMDEBEDDED and REFERENCED images,

    # Save markdown with embedded pictures
    md_filename = output_dir / f"{doc_filename}-with-images.md"
    conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)

    # Save markdown with externally referenced pictures
    md_filename = output_dir / f"{doc_filename}-with-image-refs.md"
    conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)

    # Save HTML with externally referenced pictures
    html_filename = output_dir / f"{doc_filename}-with-image-refs.html"
    conv_res.document.save_as_html(html_filename, image_mode=ImageRefMode.REFERENCED)

PeterStaar-IBM avatar Dec 19 '24 07:12 PeterStaar-IBM

@PeterStaar-IBM can we use the function in docling cli?

yangjuncode avatar Dec 19 '24 10:12 yangjuncode

@yangjuncode yes, use the image-export-mode,

taa@Munlochy docling % poetry run docling --help

 Usage: docling [OPTIONS] source

╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. [default: None] [required]                                                                                               │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from                                                       [docx|pptx|html|xml_pubmed|image|pdf|asciidoc|md|xlsx|xml_uspto]  Specify input formats to convert from. Defaults to all formats. [default: None]             │
│ --to                                                         [md|json|html|text|doctags]                                       Specify output formats. Defaults to Markdown. [default: None]                               │
│ --image-export-mode                                          [placeholder|embedded|referenced]                                 Image export mode for the document (only in case of JSON, Markdown or HTML). With           │
│                                                                                                                                `placeholder`, only the position of the image is marked in the output. In `embedded` mode,  │
│                                                                                                                                the image is embedded as base64 encoded string. In `referenced` mode, the image is exported │
│                                                                                                                                in PNG format and referenced from the main exported document.                               │
│                                                                                                                                [default: embedded]                                                                         │
│ --ocr                         --no-ocr                                                                                         If enabled, the bitmap content will be processed using OCR. [default: ocr]                  │
│ --force-ocr                   --no-force-ocr                                                                                   Replace any existing text with OCR generated text over the full content.                    │
│                                                                                                                                [default: no-force-ocr]                                                                     │
│ --ocr-engine                                                 [easyocr|tesseract_cli|tesseract|ocrmac|rapidocr]                 The OCR engine to use. [default: easyocr]                                                   │
│ --ocr-lang                                                   TEXT                                                              Provide a comma-separated list of languages used by the OCR engine. Note that each OCR      │
│                                                                                                                                engine has different values for the language names.                                         │
│                                                                                                                                [default: None]                                                                             │
│ --pdf-backend                                                [pypdfium2|dlparse_v1|dlparse_v2]                                 The PDF backend to use. [default: dlparse_v2]                                               │
│ --table-mode                                                 [fast|accurate]                                                   The mode to use in the table structure model. [default: fast]                               │
│ --artifacts-path                                             PATH                                                              If provided, the location of the model artifacts. [default: None]                           │
│ --abort-on-error              --no-abort-on-error                                                                              If enabled, the bitmap content will be processed using OCR. [default: no-abort-on-error]    │
│ --output                                                     PATH                                                              Output directory where results are saved. [default: .]                                      │
│ --verbose                 -v                                 INTEGER                                                           Set the verbosity level. -v for info logging, -vv for debug logging. [default: 0]           │
│ --debug-visualize-cells       --no-debug-visualize-cells                                                                       Enable debug output which visualizes the PDF cells [default: no-debug-visualize-cells]      │
│ --debug-visualize-ocr         --no-debug-visualize-ocr                                                                         Enable debug output which visualizes the OCR cells [default: no-debug-visualize-ocr]        │
│ --debug-visualize-layout      --no-debug-visualize-layout                                                                      Enable debug output which visualizes the layour clusters                                    │
│                                                                                                                                [default: no-debug-visualize-layout]                                                        │
│ --debug-visualize-tables      --no-debug-visualize-tables                                                                      Enable debug output which visualizes the table cells [default: no-debug-visualize-tables]   │
│ --version                                                                                                                      Show version information.                                                                   │
│ --document-timeout                                           FLOAT                                                             The timeout for processing each document, in seconds. [default: None]                       │
│ --num-threads                                                INTEGER                                                           Number of threads [default: 4]                                                              │
│ --device                                                     [auto|cpu|cuda|mps]                                               Accelerator device [default: auto]                                                          │
│ --help                                                                                                                         Show this message and exit.                                                                 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

PeterStaar-IBM avatar Dec 19 '24 12:12 PeterStaar-IBM

I noticed that in the html_backend.py, image itself is not added into PictureItem. Therefore, no image is exported and referenced when I use save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)

def handle_image(self, element, idx, doc): """Handles image tags (img).""" doc.add_picture(parent=self.parents[self.level], caption=None)

wzdavid avatar Jan 04 '25 03:01 wzdavid