status command should show if OCR has completed
This is actually quite difficult.
It turns out the textract-output/JOB_ID folder is created, empty, early on in the process. Then files called 1 and 2 and so-on are added to it - but they're not all added at once, so the existence of files in that folder doesn't necessarily mean that the OCR process has completed for that job ID.
I think the only reliable way of telling if OCR has completed is to call inspect-job:
- #15
But that's quite expensive, because it also returns the first page of JSON - which could be ~1MB of data.
I think the most efficient way to do this would be to check the expensive API for completion of each job, but then to update the .s3-ocr.json file for that key to cache the fact that we know that OCR has completed.
Another option: add a file called key.pdf.s3-ocr-complete.json indicating the OCR has finished. That way we don't need to GET each individual file to check status - we can check status on everything just by listing all keys in the bucket.
Even better: if we change the design of those JSON files to all live in the s3-ocr/ folder instead we can do a status check just with a single fetch of every key starting with that prefix, see:
- #14