File processing jobs (eg. image compression)

Open pozieuto opened this issue 5 years ago • 2 comments

This is a request for a feature to automatically run operations on specified files during maintenance using external applications.

My specific use case is that I want to convert PNG files in my database to lossless WebP to save on storage space. Some testing indicates this would reduce the storage impact of those files by one third (after aggressive optimization with cwebp). I also want to losslessly optimize JPEG files using MozJPEG. (In the future, converting PNG files to lossless JPEG XL files would make them half the size based on my testing, and lossless JPEG transcoding into JPEG XL would reduce size by ~20%, but the JPEG XL format isn't finalized yet.)

Addressing this use case by adding image optimizing code directly into Hydrus would add even more dependencies while not being very flexible. Instead, I think this would be best accomplished by creating a system where users can make jobs that run commands on files as part of maintenance. I envision these jobs as being defined / set up like this:

Job priority
Search predicates for files to match
["command to run"]
- process priority: [#]/[low/below normal/normal/above normal/high] (Linux/Windows)
- ☑︎ send interrupt signal after [#]s (useful for limiting brute-force image optimizers)
- ☑︎ terminate after [#]s
output: 🔘︎["file path"] 🔘︎copy of input file 🔘︎stdout ("copy of input file" is for applications that always overwrite the input file)
Conditions to evaluate:
- ☑︎ exit code: [#]
- ☑︎ height/width: 🔘︎< 🔘︎≈ 🔘︎= 🔘︎> [#] ☑︎compare to input
- ☑︎ ratio: 🔘︎= 🔘︎wider than 🔘︎taller than 🔘︎≈ [#]:[#] ☑︎compare to input
- ☑︎ num_pixels: 🔘︎< 🔘︎≈ 🔘︎= 🔘︎> [#] 🔘︎pixels 🔘︎kilopixels 🔘︎megapixels ☑︎compare to input
- ☑︎ duration: 🔘︎has duration 🔘︎no duration 🔘︎< 🔘︎≈ 🔘︎= 🔘︎> [#]s [#]ms ☑︎compare to input
- ☑︎ framerate: 🔘︎< 🔘︎= 🔘︎> [#]fps ☑︎compare to input
- ☑︎ number of frames: 🔘︎< 🔘︎≈ 🔘︎= 🔘︎> [#] ☑︎compare to input
- ☑︎ filesize: 🔘︎< 🔘︎≈ 🔘︎= 🔘︎> [#][B, KB, MB, GB] ☑︎compare to input
- ☑︎ filetype: [match input, mismatch input, system:filetype list...]
- ☑︎ has audio: [yes, no, match input, mismatch input]
- ☑︎ similarity: [0-63] ☑︎check for pixel-for-pixel duplicates
Actions to take (when conditions are true):
- ☑︎ relationship between new and old files: [new is better, old is better, same quality, related alternates, not related/false positive, potential duplicates]
- Copy metadata to new file:
  - Tags (by service)
  - Ratings (by service)
  - ☑︎ URLs
  - ☑︎ notes
  - ☑︎ viewing stats
  - ☑︎ file relationships
  - ☑︎ import time
  - ☑︎ modified time
- Add tags to new file
- Add tags to old file
- send new file to: [match input, inbox, archive]
- ☑︎ send old file to: [inbox, archive, trash, delete permanently]
Files processed (#/#)
- Status: (output invalid, conditions failed: (list), added file: #hash)

The actions would never be taken if the output is not a valid file (including if it doesn't exist or didn't modify the copy file). When the actions are not taken, Hydrus instead deletes/discards the output. The actions to take would always include first importing the new file. Jobs wouldn't run on files they created even if those files match the search predicates (to prevent infinite loops).

The "command to run" field obviously needs variable substitution for the input file path. Perhaps some additional variable substitutions would also be useful to have here, though I'm not sure what those would be.

As an aside: I have "import time" and "modified time" as options of metadata to copy because I still use them for crudely sorting by when something was published, since Hydrus does not yet store that as its own type of metadata. If that were to change, I wouldn't have need for those two options (although they do make for more "seamless" file replacements that some users might prefer anyway).

Some similar infrastructure to this already exists in other areas:

The whole job system is similar to the "file maintenance" functionality (database > maintenance > files > manage scheduled jobs), but I think that lacks features this would need (though I am not sure, as I have never used it and can't find much documentation on the particulars of how it functions):
- Persistent jobs: file maintenance appears to operate on files selected upon creation by a search, rather than using the search predicates themselves. This means the job does not select new files as they appear, and instead the job is removed when it is finished operating on its list of files.
- Priority: I'm not sure if the jobs in the "scheduled work" tab are run the order they appear, but it doesn't look to me like they can be reordered in any case.
Many of the "conditions to evaluate" copy their respective search predicate UIs.
"Actions to take" is like a custom duplicate action, but that only lets you configure copying of tags, ratings, archive status, and URLs. The "actions" also wouldn't need to be bidirectional, as the new file has just been imported and so has no metadata to copy to the old file.

As an example for my own workflow for losslessly converting PNG files to optimized WebP files using, I would have two jobs: one using cwebp to quickly convert PNG files (for quick storage space savings), and another to run afterwards that will optimize more aggressively (also using cwebp). On both, I would check that the exit code is 0, the output file is smaller, and the output file is a pixel-for-pixel duplicate; set the relationship as "new is better"; copy all metadata; and delete the old file.

There are many things you could do with this besides lossless image compression, though. Some examples:

Compress media lossily to save more storage space while being perceptually indistinguishable
Upscale small images tagged with "anime" using waifu2x
Use FFmpeg to fix the aspect ratio on some video files

Being able to manage these jobs within Hydrus itself has several benefits over doing them manually:

Much less work for the user compared to exporting, processing, and importing the files themselves, even if they automate some of this with the client API
Allows metadata to be copied directly, rather than having to copy it over manually or use the client API
Has features not available in the client API, such as setting relationships
Can automatically be done as part of Hydrus' background maintenance work, whereas running these processes manually/separately during downtime could interfere with Hydrus' conditions of when the system is busy
Can be set up by less experienced users much more easily, compared to the programming ability necessary to create a script that uses the client API

One feature I would like is being able to see how much storage space I've saved by running image conversions/optimizations, but I have no idea how that would be implemented into this system.

Nov 12 '20 04:11 pozieuto