OCR
Do the documents need to be OCRed prior to uploading?
No they dont. We have Apache Tika embedded, which uses Google Tesseract under the hood for OCR.
On Mon, Mar 11, 2019 at 5:42 PM dwmcqueen [email protected] wrote:
Do the documents need to be OCRed prior to uploading?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/LexPredict/lexpredict-contraxsuite/issues/46, or mute the thread https://github.com/notifications/unsubscribe-auth/AdAEOvXLLetpVYo919wA3hdY8doUZ2zFks5vVs2wgaJpZM4bpny6 .
--
*Eric Detterman *| VP and Global Head of Products and Solution Engineering, LexPredict, LLC *Email: *[email protected] *LinkedIn: * *https://www.linkedin.com/in/ericdetterman https://www.linkedin.com/in/ericdetterman**Web: *https://www.lexpredict. https://www.lexpredict.com/com/ https://www.lexpredict.com/
Cell: +1 (248) 550-2111
--
CONFIDENTIALITY NOTICE: This transmission, including any attachments, may contain confidential, protected, or sensitive information. If you are not the intended recipient of this transmission, you may not disclose, copy, redistribute, or use the contents of this message. If you have received this email in error, please destroy it and notify the sender immediately.
I just attempted a clean and reinstall and tried loading a doc that was not OCRed.
I got this error:
`Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06 INFO 2019-03-19 23:07:24 | Celery task id: fc37ca52-d218-4cdd-9a49-69bb95381e06
Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06 INFO 2019-03-19 23:07:24 | Start task "Load Documents", id=None Kwargs: {'project': {'model': 'project.project', 'pk': 1}, 'source_data': '/', 'source_type': 'agreements', 'document_type': {'model': 'document.documenttype', 'pk': '68f992f1-dba3-4dc0-a815-4d868b23c5b4'}, 'detect_contract': True, 'delete': False, 'run_standard_locators': True, 'user_id': 1, 'metadata': {'result_links': [{'name': 'View Document List', 'link': 'document:document-list'}, {'name': 'View Text Unit List', 'link': 'document:text-unit-list'}]}, 'task_id': 'fc37ca52-d218-4cdd-9a49-69bb95381e06'} Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06 INFO 2019-03-19 23:07:24 | Parse / at NginxFileAccess: http://contrax-nginx:80/media/data/documents/ Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06 INFO 2019-03-19 23:07:24 | Detected 1 files. Added 1 subtasks. Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06 INFO 2019-03-19 23:07:24 | Load Documents: starting 1 sub-tasks... Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: fc37ca52-d218-4cdd-9a49-69bb95381e06 INFO 2019-03-19 23:07:25 | End of main task "Load Documents", id=None. Sub-tasks may be still running. Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda INFO 2019-03-19 23:07:25 | Trying TIKA for file: JS#52732.PDF Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda ERROR 2019-03-19 23:07:26 | TIKA returned too small text for file: JS#52732.PDF Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda INFO 2019-03-19 23:07:26 | Trying Textract for file: JS#52732.PDF Main task: fc37ca52-d218-4cdd-9a49-69bb95381e06 | Sub-task: 50a42743-5857-430f-97e7-4137e6021dda INFO 2019-03-19 23:07:26 | Caught exception while trying to parse file with Textract: JS#52732.PDF Traceback (most recent call last): File "/contraxsuite_services/apps/task/tasks.py", line 597, in try_parsing_with_textract return textract2text(file_path, ext=ext), 'textract' File "/contraxsuite_services/apps/task/utils/ocr/textract.py", line 116, in textract2text text = process(path, ext=ext, method='tesseract', language=language) File "/contraxsuite_services/apps/task/utils/ocr/textract.py", line 99, in process filetype_module = importlib.import_module(rel_module, 'textract.parsers') File "/contraxsuite_services/venv/lib/python3.6/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 994, in _gcd_import File "", line 971, in _find_and_load File "", line 953, in _find_and_load_unlocked ModuleNotFoundError: No module named 'textract.parsers.PDF_parser'`
Looks like there is an issue with Tesseract in latest version. I did a full clean reinstall of 1.1.9 and keep getting a ModuleNotFoundError: No module named 'textract.parsers.PDF_parser'` even on previously OCRed text.
If it helps, here is the output of docker ls:
ub5b48qsfg0s contraxsuite_contrax-celery global 1/1 lexpredict/lexpredict-contraxsuite:latest ngb0mq80ze6g contraxsuite_contrax-celery-beat replicated 1/1 lexpredict/lexpredict-contraxsuite:latest lzbuwjlkxfx4 contraxsuite_contrax-curator_filebeat replicated 1/1 stefanprodan/es-curator-cron:latest pn8w3ejqmsuf contraxsuite_contrax-curator_metricbeat replicated 0/0 stefanprodan/es-curator-cron:latest p928pz2n09ym contraxsuite_contrax-db replicated 1/1 postgres:9.6 tmpz5r4tkhcb contraxsuite_contrax-elasticsearch replicated 1/1 docker.elastic.co/elasticsearch/elasticsearch-oss:6.2.4 w8nwy98y4rlj contraxsuite_contrax-filebeat global 1/1 docker.elastic.co/beats/filebeat:6.2.4 ir5yt9t1kg47 contraxsuite_contrax-flower replicated 0/0 lexpredict/lexpredict-contraxsuite:latest pock348z204w contraxsuite_contrax-jupyter replicated 1/1 lexpredict/lexpredict-contraxsuite:latest seulb1l7wcya contraxsuite_contrax-kibana replicated 1/1 docker.elastic.co/kibana/kibana-oss:6.2.4 us12mggxpgz5 contraxsuite_contrax-logrotate global 1/1 tutum/logrotate:latest m3cwbg5xibfj contraxsuite_contrax-metricbeat replicated 0/0 docker.elastic.co/beats/metricbeat:6.2.4 l4d2wnujj4gw contraxsuite_contrax-nginx replicated 1/1 nginx:stable *:80->8080/tcp, *:443->4443/tcp lqo0l3ubbsz7 contraxsuite_contrax-rabbitmq replicated 1/1 rabbitmq:3-management uul2xgwxo17u contraxsuite_contrax-tika global 1/1 lexpredict/tika-server:latest azlhtr3dv8nn contraxsuite_contrax-uwsgi replicated 1/1 lexpredict/lexpredict-contraxsuite:latest
Sorry for the frequent update. I did confirm that running OCR locally on the document and re-uploading allowed the standard Load Document task to function correctly. So something seems amiss with the Tesseract OCRing process.