Support Image path/url
When I extract content from URL, for image only get placeholder <!-- image -->.
I want to save image info like path/url in .doctags or .json
If this function support already, can anybody show me the code.
My code now is like :
import json
import logging
import time
from pathlib import Path
from docling.backend.html_backend import HTMLDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.pipeline.simple_pipeline import PipelineOptions
from docling.document_converter import DocumentConverter, SimplePipeline, HTMLFormatOption
from docling_core.types.doc import (
PictureItem,
TextItem,
)
_log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
options = PipelineOptions()
s_pip = SimplePipeline(options)
doc_converter = DocumentConverter(
allowed_formats=[
# InputFormat.IMAGE,
InputFormat.HTML,
],
format_options={
InputFormat.HTML: HTMLFormatOption(
pipeline_cls=SimplePipeline, # class,not instance
backend=HTMLDocumentBackend
)
}
)
url = 'https://huggingface.co/blog/aya-expanse'
start_time = time.time()
conv_result = doc_converter.convert(url)
end_time = time.time() - start_time
for item, level in conv_result.document.iterate_items():
print('-- ', type(item), level)
if isinstance(item, TextItem):
print(item.text)
elif isinstance(item, PictureItem):
print('-- ', item.label)
pass
_log.info(f"Document converted in {end_time:.2f} seconds.")
## Export results
# output_dir = Path("scratch")
# output_dir = '/Users/pc087/Documents/code/code24/03-pdf/docli/scratch'
output_dir = Path("03-pdf/docli/scratch")
output_dir.mkdir(parents=True, exist_ok=True)
doc_filename = conv_result.input.file.stem
print('-- output_dir : ', output_dir, doc_filename )
# Export Deep Search document JSON format:
with (output_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
fp.write(json.dumps(conv_result.document.export_to_dict(), ensure_ascii=False))
# Export Markdown format:
with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
fp.write(conv_result.document.export_to_markdown())
# Export Document Tags format:
with (output_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp:
fp.write(conv_result.document.export_to_document_tokens())
Same question here
@kime541200 @ezscode We are actively working to support this case, a new release will bring this capability soon.
@kime541200 @ezscode We are actively working to support this case, a new release will bring this capability soon.
Is this supported ? How to call it ?
@ezscode Yes, please look here: https://github.com/DS4SD/docling/blob/main/docs/examples/export_figures.py#L73
We have the save_as_markdown and save_as_html with EMDEBEDDED and REFERENCED images,
# Save markdown with embedded pictures
md_filename = output_dir / f"{doc_filename}-with-images.md"
conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)
# Save markdown with externally referenced pictures
md_filename = output_dir / f"{doc_filename}-with-image-refs.md"
conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)
# Save HTML with externally referenced pictures
html_filename = output_dir / f"{doc_filename}-with-image-refs.html"
conv_res.document.save_as_html(html_filename, image_mode=ImageRefMode.REFERENCED)
@PeterStaar-IBM can we use the function in docling cli?
@yangjuncode yes, use the image-export-mode,
taa@Munlochy docling % poetry run docling --help
Usage: docling [OPTIONS] source
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from [docx|pptx|html|xml_pubmed|image|pdf|asciidoc|md|xlsx|xml_uspto] Specify input formats to convert from. Defaults to all formats. [default: None] │
│ --to [md|json|html|text|doctags] Specify output formats. Defaults to Markdown. [default: None] │
│ --image-export-mode [placeholder|embedded|referenced] Image export mode for the document (only in case of JSON, Markdown or HTML). With │
│ `placeholder`, only the position of the image is marked in the output. In `embedded` mode, │
│ the image is embedded as base64 encoded string. In `referenced` mode, the image is exported │
│ in PNG format and referenced from the main exported document. │
│ [default: embedded] │
│ --ocr --no-ocr If enabled, the bitmap content will be processed using OCR. [default: ocr] │
│ --force-ocr --no-force-ocr Replace any existing text with OCR generated text over the full content. │
│ [default: no-force-ocr] │
│ --ocr-engine [easyocr|tesseract_cli|tesseract|ocrmac|rapidocr] The OCR engine to use. [default: easyocr] │
│ --ocr-lang TEXT Provide a comma-separated list of languages used by the OCR engine. Note that each OCR │
│ engine has different values for the language names. │
│ [default: None] │
│ --pdf-backend [pypdfium2|dlparse_v1|dlparse_v2] The PDF backend to use. [default: dlparse_v2] │
│ --table-mode [fast|accurate] The mode to use in the table structure model. [default: fast] │
│ --artifacts-path PATH If provided, the location of the model artifacts. [default: None] │
│ --abort-on-error --no-abort-on-error If enabled, the bitmap content will be processed using OCR. [default: no-abort-on-error] │
│ --output PATH Output directory where results are saved. [default: .] │
│ --verbose -v INTEGER Set the verbosity level. -v for info logging, -vv for debug logging. [default: 0] │
│ --debug-visualize-cells --no-debug-visualize-cells Enable debug output which visualizes the PDF cells [default: no-debug-visualize-cells] │
│ --debug-visualize-ocr --no-debug-visualize-ocr Enable debug output which visualizes the OCR cells [default: no-debug-visualize-ocr] │
│ --debug-visualize-layout --no-debug-visualize-layout Enable debug output which visualizes the layour clusters │
│ [default: no-debug-visualize-layout] │
│ --debug-visualize-tables --no-debug-visualize-tables Enable debug output which visualizes the table cells [default: no-debug-visualize-tables] │
│ --version Show version information. │
│ --document-timeout FLOAT The timeout for processing each document, in seconds. [default: None] │
│ --num-threads INTEGER Number of threads [default: 4] │
│ --device [auto|cpu|cuda|mps] Accelerator device [default: auto] │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
I noticed that in the html_backend.py, image itself is not added into PictureItem. Therefore, no image is exported and referenced when I use save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)
def handle_image(self, element, idx, doc): """Handles image tags (img).""" doc.add_picture(parent=self.parents[self.level], caption=None)