amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

JPEG conversion in `analyze_document` significantly impacts table predictions

Open Belval opened this issue 1 year ago • 1 comments

When obtaining predictions through analyze_document, the image is converted to JPEG https://github.com/aws-samples/amazon-textract-textractor/blob/master/textractor/textractor.py#L845. The compression is enough to degrade the table predictions.

We should check and keep the format, assuming that it is supported by Textract to avoid discrepancies between calling Textract with Textractor and calling Textract with boto3.

Belval avatar Mar 21 '24 22:03 Belval

Issue is mitigated by settign the JPEG compression parameters, will require further discussion for using PNG as we otherwise see a significant latency hit.

Belval avatar Mar 22 '24 14:03 Belval