Convert pdf to md simplified Chinese character issue
All simplified Chinese characters in the MD file generated from PDF are garbled. I user docling version 2
Are you using the CLI? I'm wondering if what you see could be solved by this fresh new PR https://github.com/DS4SD/docling/pull/214
@dolfim-ibm yes, I using tghe CLI to convert.Is it possible to specify the encoding for the output file using CLI commands?
All simplified Chinese characters in the MD file generated from PDF are garbled. I user docling version 2
this is due to EasyOCR configuration, you need to change the EasyOCR lang configuration to ["en", "ch_sim"]
I noted. I will try to change EasyOCR lang configuration.Thanks for your support!
And how can I change the language through DocumentConverter()? Changing it through EasyOCR didnt work. Context is that I'm trying to use it for portuguese, but defining "pt" for EasyOCR isnt working:
ocr_options = EasyOcrOptions(lang=['en', 'pt'])# , use_gpu=True)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = ocr_options
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
So there is no way to specified EasyOCR lang by CLI? I found the default langs supported are en, fr, de and es.
@derekhsu Yes, the selection of languages in the cli will be supported soon.
The option will be supported in the cli this week.
The CLI options for OCR have been released in https://github.com/DS4SD/docling/releases/tag/v2.6.0