kernel-memory icon indicating copy to clipboard operation
kernel-memory copied to clipboard

[Question] OCR

Open xuzeyu91 opened this issue 1 year ago • 3 comments

Context / Scenario

54d4edad5bc9a4867cfffc214c9c94f I referred to this example and wrote an implementation of OCR. Attempting to scan PDF and PDF containing images did not trigger it. I'm not sure if there was anything wrong with the operation

Question

54d4edad5bc9a4867cfffc214c9c94f I referred to this example and wrote an implementation of OCR. Attempting to scan PDF and PDF containing images did not trigger it. I'm not sure if there was anything wrong with the operation

xuzeyu91 avatar Apr 10 '24 04:04 xuzeyu91

Looks like this is currently not possible, see code: https://github.com/microsoft/kernel-memory/blob/main/service/Core/DataFormats/Pdf/PdfDecoder.cs

Altough we already have (https://github.com/microsoft/kernel-memory/blob/main/service/Abstractions/DataFormats/IOcrEngine.cs) in place, which would be enough for simple text extraction, and UglyToad.PdfPig is able to extract images as experimental feature.

@dluc Wouldn't it be possible to extend "FileContent" with a Array of found Images in the PDF described GPT-4 Vision Api if enabled?

lecramr avatar Apr 12 '24 11:04 lecramr

I think that you can support this scenario when the issue https://github.com/microsoft/kernel-memory/issues/379 will be completed (currently there is a PR in preview).

With that, you will be able to inject a custom decoder for PDF files.

marcominerva avatar Apr 12 '24 11:04 marcominerva

Given that now custom content decoders can be injected, I would first try creating one that replaces the default PDF decoder, and internally does all the work of extracting text and text from images. E.g. you can create a decoder that depends on the existing image decoder to parse images, and return all the text at the end, without the need to revisit the FileContent class (for now).

dluc avatar Apr 16 '24 00:04 dluc