workflow_ocr icon indicating copy to clipboard operation
workflow_ocr copied to clipboard

OCR workflow should maintain modification date of original file

Open ferdiga opened this issue 1 year ago • 3 comments

Describe the bug

we plan to load historical pdf files into the database and want to make them searchable using OCR workflow, which changes the modification date of the file - hence the important historical context of the modification date is "lost", limiting the usability of this great feature.

The ocrmypdf maintainer confirms, that ocrmypdf must change the modification date to comply to the standard.

For the OCR workflow I see 2 options:

  • optionally restore the original modification date after adding the OCR layer.
  • add the original modification date to the file name. Options -- prepend using a format which allows sorting of the files like "yyyymmdd" -- append

I have created a little python script which prepends the original modification date to all pdf files if no date is found at the beginning of the file to overcome this situation, but want to clarify the situation before I proceed.

System

  • App version: 1.29
  • Nextcloud version: 29.0.3

How to reproduce

Steps to reproduce the behavior: trigger the OCR Workflow

ferdiga avatar Jul 22 '24 09:07 ferdiga

Additional remark: I would go for "restore the original modification date after adding the OCR layer." because

  • we need to sort files by name and by (original modification) date.

ferdiga avatar Jul 22 '24 17:07 ferdiga

Hi @ferdiga, thanks for the comprehensive explanation of your use-case. I think you already described that changing a file (so adding the OCR layer) automatically changes the last modified date, which is the expected behaviour when touching a file on a system and writing new content to it. The app itself just utilizes the NC API to create a new file version here. The used file_put_contents just writes the file to the disk and creates a new file version in Nextcloud without the option to change any file metadata (see here.

A possible way to implement this after the new file version has been written would be to use touch with a second argument (the old timestamp). In the UI we'd need to have an additional parameter like "Maintain original modification date". If set to true, we'd need to store the original modification date before creating the new file version, and write it back after it has been created.

Possible Workaround

For the time being one could "chain" the Workflow OCR with the Workflow Script:

  1. Create the OCR Workflow and choose "Assign tags after OCR". Choose any tag you want to assign after successful OCR (for example "OCR success")
  2. Create a second Workflow with the Workflow Script. Use the tag assignment for "OCR success" as a trigger for this workflow and implement your modification date magic directly within the triggered script

R0Wi avatar Jul 23 '24 10:07 R0Wi

Hi, thanks for looking into this, Option 2: once the file has the new tag, it has also the new timestamp. IMHO not the way to go.

What I probably will do

  • mount the NC files with davfs2 write a (python) script
  • makes a copy for pdf files without text layer - will be processed by NC OCR Workflow write a cleanup script (disable certain triggers of the OCR workflow to not trigger the Workflow again)
  • if such a copy with text layer is found , read the modification date from the original - mv the copy to the original and touch -mt with the original file stamp

another script is necessary for digitaly signed files - print not copy to destroy the signature, because the original must be preserved (ocrmypdf will not touch it) , nevertheless we want to have a searchable version.

ferdiga avatar Jul 23 '24 11:07 ferdiga