Indexing error: Invalid PATH argument. File not found
Device Information (please complete the following information):
- OS:
Debian 12 - Deployment:
Docker - Browser (if relevant):
Chrome - SIST2 Version:
3.3.4 - Elasticsearch Version (if relevant) :
7.17.9
Hi, I'm using the following docker-compose.yml:
version: "3"
services:
elasticsearch:
image: elasticsearch:7.17.9
restart: unless-stopped
environment:
- "discovery.type=single-node"
- "ES_JAVA_OPTS=-Xms2g -Xmx2g"
sist2-admin:
image: simon987/sist2:3.3.4-x64-linux
restart: unless-stopped
volumes:
- ./container-data/sist2-admin-data/:/sist2-admin/
- ./container-data/files:/host-files2
ports:
- 4090:4090 # sist2
- 8080:8080 # sist2-admin
working_dir: /root/sist2-admin/
entrypoint: python3 /root/sist2-admin/sist2_admin/app.py
I have verified that my files are present in the container at the following path: /host-files. I created a program called Test in sist2-admin, selected the search engine elasticsearch (test result: Elasticsearch version 7.17.9), went to the files /host-files2/ and clicked "Enable image file recognition" in Tesseract eng and rus. When I start indexing, I get the following error:
[ADMIN ] Starting sist2 command with args ['/root/sist2', 'index', '/sist2-admin/scan-Test-2024-01-23 18:36:05.735720.sist2', '--threads=1', '--es-url=http://elasticsearch:9200', '--es-index=sist2', '--batch-size=70', '--incremental-index', '--json-logs', '--very-verbose']
2024-01-23 18:36:06 [FATAL cli.c] Invalid PATH argument. File not found: /sist2-admin/scan-Test-2024-01-23 18:36:05.735720.sist2
[ADMIN ] Sist2Scan task finished return_code=-10, duration=datetime.timedelta(microseconds=3736)
Running indexing with the search engine sqlite produces the same result:
[ADMIN ] Starting sist2 command with args ['/root/sist2', 'sqlite-index', '/sist2-admin/scan-Test-2024-01-23 18:36:05.735720.sist2', '--search-index', '/sist2-admin/search-index-sqlite.sist2', '--json-logs', '--very-verbose']
2024-01-23 18:36:31 [FATAL cli.c] Invalid PATH argument. File not found: /sist2-admin/scan-Test-2024-01-23 18:36:05.735720.sist2
[ADMIN ] Sist2Scan task finished return_code=-10, duration=datetime.timedelta(microseconds=3716)
But running indexing without OCR starts without problems. Although there are a lot of errors in the log of the following type:
2024-01-23 19:01:37 [ERROR ooxml.c] Got fatal XML error while parsing document: Start tag expected, '<' not found
2024-01-23 19:01:37 [ERROR ooxml.c] Got fatal XML error while parsing document: Start tag expected, '<' not found
2024-01-23 19:01:39 [ERROR ooxml.c] Got fatal XML error while parsing document: Start tag expected, '<' not found
Initially, I had about 3,000 files .doc, .docx and .pdf, and I was able to index them by some miracle, I did it when I set up the rights to the /host-files2 folder as follows: root:root 755. Now I have uploaded several times more different files (including image files), but I could not index the files with the new job (with OCR).
Steps To Reproduce Please be specific!
- Go to
sist2-admin - Click on
[job name]andIndex now - Click on
Tasksand see the indexing error
Expected behavior Indexing of files using OCR and the selected search engine should begin
Actual Behavior
I get an error that is related to the missing scan file /sist2-admin/scan-Test-2024-01-23 18:36:05.735720. sist2
Screenshots
The indexing process without OCR:
Recurring errors with such indexing:
Error when starting indexing with OCR:
Unfortunately, I have no more ideas how to fix this error. I tried to experiment with the rights to the file folder, deleted
sist2-admin-data and recreated the containers, nothing helped.
I really liked your product, and I would like it to continue to develop. I hope for your help, thank you!
What does the end of the scan log file say?
Did you check to see if the File is actually present? Like use docker exec -it sist2-admin bash then cd /sist2-admin
What does the end of the
scanlog file say?
At the moment, indexing looks like this:
The log looks like this (no longer updated):
On the job page
Test, the status is failed:
For this reason, I cannot create a frontend.
Did you check to see if the File is actually present? Like use docker exec -it sist2-admin bash then cd /sist2-admin
If you mean the file /sist2-admin/scan-Test-2024-01-23 18:36:05.735720.sist2, it wasn't there when I started indexing with OCR.
The file was created only after indexing without OCR, but, as I wrote earlier, it also failed.
Now I tried to delete the sist2-admin-data folder, recreated the container, after which I created the Test job, specified the sqlite search engine with OCR and started indexing - indexing went without problems.
After that, I changed the search engine to
elasticsearch in the Test job, after which I started indexing - it also passed without problems.
This line seems interesting:
2024-01-24 06:06:46 [DEBUG database.c] Closing database /dev/shm/sist2-ipc-31.sqlite (0x565134449bb8)
Elasticsearch can't work without sqlite indexing?
For the purity of the experiment, I did the same thing - I get an error again))
[ADMIN ] Starting sist2 command with args ['/root/sist2', 'sqlite-index', '/sist2-admin/scan-Test-2024-01-24 06:24:40.549224.sist2', '--search-index', '/sist2-admin/search-index-sqlite.sist2', '--json-logs', '--very-verbose']
2024-01-24 06:25:51 [FATAL cli.c] Invalid PATH argument. File not found: /sist2-admin/scan-Test-2024-01-24 06:24:40.549224.sist2
[ADMIN ] Sist2Scan task finished return_code=-10, duration=datetime.timedelta(microseconds=3477)
I don't understand what's wrong.
When a Job is run, it creates two tasks: (1) Scan; and (2) Index
During the scan task, SIST2 goes through the files and pulls out the requested data and stores it in a .sist2 sqlite db.
During the index task, if scan is successful, SIST2 provides the data from the .sist2 db to the index of choice, ES or SQLITE - whichever you have setup as the search backend.
To me, it looks like the scan may be failing without marking as failed for some reason, so the sist2 database is not created by scan. Thus, when index is run, it fails because the sist2 database is not there.
Elasticsearch can't work without sqlite indexing?
That isn't from sqlite indexing. During scanning, the metadata and content from the files is stored in an sqlite db. During ES indexing, an index is made/updated and the file metadata and content is stored in ES. The sist2 file also stores metadata about the scan that was performed as well as information such as embeddings, the stats page aggregations and treemaps, the tags, the thumbnails, version info, etc.