Use a named temporary directory for OCR jobs
Descriptive summary
Derivative jobs that run Tesseract create their temporary files without specifying a path, causing them to be created in the CWD of the app (/data)
Defined in: oregon_digital/hocr_derivative_service.rb
def temporary_output
@temporary_output ||= Tempfile.new
end
Tempfile.new()
new(basename="", tmpdir=nil, mode: 0, **options)
Output files from Tesseract:
20220408-40-1010b8s 20220408-40-hfpucb build od220220408-40-1vy37qn.png
20220408-40-1010b8s.hocr 20220408-40-hfpucb.hocr config od220220408-40-1w19tt0.png
20220408-40-14tld02 20220408-40-jdiubi config.ru od220220408-40-1yz8x4l.png
20220408-40-14tld02.hocr 20220408-40-radpif db od220220408-40-8g05f2.png
20220408-40-179lvfx 20220408-40-radpif.hocr docker-compose.override.yml-example od220220408-40-jmtclg.png
20220408-40-179lvfx.hocr 20220408-40-taic4n docker-compose.yml od220220408-40-m1cd6k.png
20220408-40-1fo71vt 20220408-40-taic4n.hocr fits.log od220220408-40-wdxiax.png
20220408-40-1fo71vt.hocr 20220408-40-w4keen lib package.json
20220408-40-1g6iucs Gemfile log public
20220408-40-1g6iucs.hocr Gemfile.lock node_modules spec
20220408-40-1rbc9m5 README.md od220220407-40-4c6ezy.jp2 tmp
20220408-40-1rbc9m5.hocr Rakefile od220220408-40-19ff4yv.png vendor
20220408-40-1y9shxr app od220220408-40-1gozlc0.png yarn.lock
20220408-40-1y9shxr.hocr bin od220220408-40-1oxzb0t.png
Expected behavior
Temporary files should be created in a standard location. For better performance we should consider writing these to a RAM disk. The directory to be used for each derivative type should be configurable through Environment Variables at runtime.
I would suggest we move to a per-content-type temporary file base directory to allow us more flexibility.
- Video temporary directory
- Audio temporary directory
- OCR temporary directory
- ???
- General derivatives temporary directory
With derivative generation we want temporary files to be blazing fast, so these should be getting written to a RAM disk. We should investigate grabbing the size of the source file to determine where to write temporary files. If a file is below a given size, temporary files should be written to a RAM disk. If it is bigger than N bytes, to avoid filling the RAM disk and exhausting the node's RAM we'll have to write large temporary files to a local disk.
- Video
- Video Small (
/ramdisk/video) - Video Large (
/local/video)
- Video Small (
- Audio
- Audio Small (
/ramdisk/audio) - Audio Large (
/local/audio)
- Audio Small (
- OCR
- OCR Small (
/ramdisk/ocr) - OCR Large (
/local/ocr)
- OCR Small (
- ???
- General
- General Small (
/ramdisk/general) - General Large (
/local/general)
- General Small (
We can add a Memory type emptyDir volume (/ramdisk) for each worker. When the Pod is started, the RAM disk gets created and mounted at the directory we give it. If the container restarts/crashes the data will persist, but when we boot a new version of the container or move them between nodes the data won't be persistent. This is fine for temporary files, it's even a helpful feature to avoid orphaned temporary files from accumulating. We'll need to define a benchmark file size that is small enough to use the RAM disk for,
For larger items, we'll have to write those to disk. We'll investigate enhancements to local container storage to make those as fast as possible.
We have the Hydra::Derivatives.temp_file_base. We should add some additional temp_file_base config variables and use the appropriate temp_file_base location depending on the derivative content type.
Each of these new variables should be set from corresponding environment variables.
Are there other content types that need their own temp_file_base?
module OregonDigital::Derivatives::Image
# Simple derivative utility functions
class Utils
class << self
# Generates a temporary file and passes its path to the given block. The
# file is deleted at the end of execution
def tmp_file(ext)
f = Tempfile.new(['od2', ".#{ext}"], Hydra::Derivatives.temp_file_base)
begin
yield f.path
ensure
f.close
f.unlink
end
end