s3-ocr icon indicating copy to clipboard operation
s3-ocr copied to clipboard

status command should show if OCR has completed

Open simonw opened this issue 3 years ago • 2 comments

This is actually quite difficult.

It turns out the textract-output/JOB_ID folder is created, empty, early on in the process. Then files called 1 and 2 and so-on are added to it - but they're not all added at once, so the existence of files in that folder doesn't necessarily mean that the OCR process has completed for that job ID.

simonw avatar Jun 30 '22 20:06 simonw

I think the only reliable way of telling if OCR has completed is to call inspect-job:

  • #15

But that's quite expensive, because it also returns the first page of JSON - which could be ~1MB of data.

I think the most efficient way to do this would be to check the expensive API for completion of each job, but then to update the .s3-ocr.json file for that key to cache the fact that we know that OCR has completed.

simonw avatar Jun 30 '22 20:06 simonw

Another option: add a file called key.pdf.s3-ocr-complete.json indicating the OCR has finished. That way we don't need to GET each individual file to check status - we can check status on everything just by listing all keys in the bucket.

Even better: if we change the design of those JSON files to all live in the s3-ocr/ folder instead we can do a status check just with a single fetch of every key starting with that prefix, see:

  • #14

simonw avatar Jun 30 '22 20:06 simonw