docs icon indicating copy to clipboard operation
docs copied to clipboard

Teedy importer fails to upload documents - 400 error

Open savvasdalkitsis opened this issue 3 years ago • 9 comments

I tried running the latest release that had binaries from both Windows and linux and got the same results.

All documents fail to upload with a -null message.

Enabling NODE_DEBUG=request shows me that the PUT returns a 400 for each document. I tried accessing my instance both from its public url (via cloudflare) but also from its local ip on the host machine in case it was cloudflare that was causing me issues:

✖ Upload failed for /home/savvasdalkitsis/Papers/Papers/Financial/Bank/Lloyds/Statement 5 years till 01-2016/Statement 2.pdf: null
⠋ Importing: /home/savvasdalkitsis/Papers/Papers/Financial/Bank/Lloyds/Statement 5 years till 01-2016/Statement 3.pdfREQUEST {
  jar: true,
  url: 'http://172.18.0.41:8080/api/tag/list',
  callback: [Function],
  method: 'GET'
}
REQUEST make request http://172.18.0.41:8080/api/tag/list
REQUEST onRequestResponse http://172.18.0.41:8080/api/tag/list 200 {
  connection: 'close',
  'cache-control': 'no-cache',
  expires: '0',
  'content-type': 'application/json',
  'content-length': '350',
  server: 'Jetty(9.4.36.v20210114)'
}
REQUEST reading response's body
REQUEST finish init function http://172.18.0.41:8080/api/tag/list
REQUEST response end http://172.18.0.41:8080/api/tag/list 200 {
  connection: 'close',
  'cache-control': 'no-cache',
  expires: '0',
  'content-type': 'application/json',
  'content-length': '350',
  server: 'Jetty(9.4.36.v20210114)'
}
REQUEST end event http://172.18.0.41:8080/api/tag/list
REQUEST has body http://172.18.0.41:8080/api/tag/list 350
REQUEST emitting complete http://172.18.0.41:8080/api/tag/list
REQUEST {
  jar: true,
  url: 'http://172.18.0.41:8080/api/document',
  form: 'title=Statement%203.pdf&language=eng&tags=',
  callback: [Function],
  method: 'PUT'
}
REQUEST make request http://172.18.0.41:8080/api/document
REQUEST onRequestResponse http://172.18.0.41:8080/api/document 400 {
  connection: 'close',
  'cache-control': 'no-cache',
  expires: '0',
  'content-type': 'application/json',
  'content-length': '50',
  server: 'Jetty(9.4.36.v20210114)'
}

savvasdalkitsis avatar Sep 01 '22 16:09 savvasdalkitsis

More info. Checking out the project locally and running it from code, after modifying this line:

https://github.com/sismics/docs/blob/d51dfd6636ba676a8deb9e8d23bd4ce9667e3c7a/docs-importer/main.js#L454

To instead of printing the error to print response.body I now see these messages:

Upload failed for FILE : {"type":"TagNotFound","message":"Tag not found: "}

It looks like not selecting tags for the documents causes the upload to fail?

savvasdalkitsis avatar Sep 01 '22 16:09 savvasdalkitsis

I can verify that selecting a label to apply to all docs the upload completes

savvasdalkitsis avatar Sep 01 '22 16:09 savvasdalkitsis

@savvasdalkitsis Thanks for looking into the code, if you have time to fix #619 and #602 that would be awesome! (both related to the importer).

jendib avatar Sep 01 '22 16:09 jendib

Hm this issue seems to be with the backend not accepting no tags when adding files. I can take a stab at fixing this but I don't have a wide picture of the backend app and not sure if that would break anything.

savvasdalkitsis avatar Sep 01 '22 16:09 savvasdalkitsis

@savvasdalkitsis It might be because the importer is passing an empty string in the parameter tags and the backend interprets it as a list on 1 tag with no ID (which of course doesn't exist and leads to "TagNotFound").

We would need to inspect what is sent to the API exactly.

jendib avatar Sep 01 '22 16:09 jendib

Would something like this work? I cannot locally run the project for some reason (need to spend some time on that) so i am not sure what type of integration testing (if any) is in place and if i broke anything. I guess the PR will tell us

savvasdalkitsis avatar Sep 01 '22 17:09 savvasdalkitsis

It might be interesting to reproduce the issue (and confirm that it's fixed) by adding a test around here https://github.com/sismics/docs/blob/master/docs-web/src/test/java/com/sismics/docs/rest/TestDocumentResource.java#L64

jendib avatar Sep 01 '22 17:09 jendib

That's the test I had problems running. It requires external dependencies to be installed and I didnt have a lot of time to investigate. Will do that once I have some spare time

savvasdalkitsis avatar Sep 01 '22 17:09 savvasdalkitsis

Yes you are right it requires those dependencies: https://github.com/sismics/docs/blob/master/.github/workflows/build-deploy.yml#L22 ffmpeg mediainfo tesseract-ocr tesseract-ocr-deu

jendib avatar Sep 01 '22 18:09 jendib