docling icon indicating copy to clipboard operation
docling copied to clipboard

Convert pdf to md simplified Chinese character issue

Open JerryXu2023 opened this issue 1 year ago • 8 comments

All simplified Chinese characters in the MD file generated from PDF are garbled. I user docling version 2

JerryXu2023 avatar Nov 04 '24 08:11 JerryXu2023

Are you using the CLI? I'm wondering if what you see could be solved by this fresh new PR https://github.com/DS4SD/docling/pull/214

dolfim-ibm avatar Nov 04 '24 08:11 dolfim-ibm

@dolfim-ibm yes, I using tghe CLI to convert.Is it possible to specify the encoding for the output file using CLI commands?

JerryXu2023 avatar Nov 04 '24 08:11 JerryXu2023

All simplified Chinese characters in the MD file generated from PDF are garbled. I user docling version 2

this is due to EasyOCR configuration, you need to change the EasyOCR lang configuration to ["en", "ch_sim"]

shangbinbin avatar Nov 04 '24 08:11 shangbinbin

I noted. I will try to change EasyOCR lang configuration.Thanks for your support!

JerryXu2023 avatar Nov 04 '24 08:11 JerryXu2023

And how can I change the language through DocumentConverter()? Changing it through EasyOCR didnt work. Context is that I'm trying to use it for portuguese, but defining "pt" for EasyOCR isnt working:

ocr_options = EasyOcrOptions(lang=['en', 'pt'])# , use_gpu=True)

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = ocr_options

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

EdwardSJ151 avatar Nov 04 '24 20:11 EdwardSJ151

So there is no way to specified EasyOCR lang by CLI? I found the default langs supported are en, fr, de and es.

derekhsu avatar Nov 05 '24 01:11 derekhsu

@derekhsu Yes, the selection of languages in the cli will be supported soon.

PeterStaar-IBM avatar Nov 11 '24 09:11 PeterStaar-IBM

The option will be supported in the cli this week.

PeterStaar-IBM avatar Nov 18 '24 08:11 PeterStaar-IBM

The CLI options for OCR have been released in https://github.com/DS4SD/docling/releases/tag/v2.6.0

dolfim-ibm avatar Nov 19 '24 17:11 dolfim-ibm