OpenCue icon indicating copy to clipboard operation
OpenCue copied to clipboard

[rqd] Avoid overloaded log files created by RQD's jobs

Open ramonfigueiredo opened this issue 1 year ago • 1 comments

Description

Over the years, multiple instances of Digital Content Creation (DCC) jobs running on OpenCue have produced log files of excessive sizes. The causes have varied - ranging from internal code issues to external vendor code outside immediate control. While some instances have been mitigated, new occurrences continue to arise, often taking significant time to report and resolve with vendors.

To address this, implement a mechanism to limit the size of log files generated by jobs managed by RQD. Excessively large log files are rarely useful, often indicate underlying issues that prevent job completion, and can result in quota exhaustion, delaying resolution. By imposing a log file size limit, problematic jobs can be terminated automatically, allowing artists or development teams to investigate and resolve the root cause.

Proposed solution

  • Introduce a configurable RQD constant, JOB_LOG_MAX_SIZE_IN_BYTES, to define the maximum allowable size for job log files.
  • Set a default threshold (e.g., 1GB) that can be adjusted based on studio requirements.
  • When a log file exceeds the defined limit, terminate the job automatically.
  • Ensure proper logging and error messaging to notify teams of the termination and provide actionable details.

Benefits

  • Prevents log files of unreasonable sizes, ensuring system stability and storage availability.
  • Automates the handling of rogue processes without the need for constant monitoring.
  • Facilitates quicker identification and resolution of underlying issues by redirecting focus to root causes.

This solution will safeguard against storage quota exhaustion and improve the overall reliability of the system.

ramonfigueiredo avatar Dec 03 '24 19:12 ramonfigueiredo

FYI

@DiegoTavares

ramonfigueiredo avatar Dec 03 '24 19:12 ramonfigueiredo

Image

The Academy Software Foundation (ASWF) Dev Days (https://www.aswf.io/dev-days/) is a fantastic way to contribute to the amazing and important #ASWF projects (https://www.aswf.io/projects/) that positively impact our industry. Whether you’re in #vfx, #animation, #softwareengineering, or #softwaredeveloper, this is a great opportunity to get involved.

📅 Next Dev Days: September 25, 2025 (fully virtual, open worldwide)

For this event, we are highlighting OpenCue project - https://docs.opencue.io/, a production-proven render management system originally developed at Sony Pictures Imageworks and part of the ASWF. We’ve prepared more than 40 easy issues and enhancements that are ideal for first-time contributors. These tasks are straightforward, designed to take less than a day to complete, and some can be finished in just a few hours. Everyone is invited to participate and contribute!

🔗 OpenCue codebase: https://github.com/AcademySoftwareFoundation/OpenCue
📘 OpenCue documentation: https://docs.opencue.io
📖 Developer guide: https://docs.opencue.io/docs/developer-guide/index/
🧪 Using the OpenCue Sandbox for Testing: https://docs.opencue.io/docs/developer-guide/sandbox-testing/
💬 OpenCue Slack channel: https://academysoftwarefdn.slack.com/archives/CMFPXV39Q
Available tasks/issues and enhancements (good first issue + help wanted): https://github.com/AcademySoftwareFoundation/OpenCue/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22%20label%3A%22help%20wanted%22&page=2


✅ How to Contribute Effectively

  • Focus on one issue at a time. Please don’t try to solve or implement all the main issues at once.
  • When your work is complete, submit a full pull request (PR). We’ll review it as soon as possible.
  • If your PR is not yet ready, submit it as a Draft Pull Request. Once you’re confident it’s ready, change the status to Open so maintainers know it can be reviewed.

How to create a Draft PR on GitHub:

  1. When opening a pull request, click the dropdown on the green “Create pull request” button and select “Create draft pull request”.
  2. Your PR will be clearly marked as Draft until you’re ready.
  3. When it’s ready for review, click “Ready for review” at the top of the PR page - this changes the status to Open.

Thank you for your contribution!
Before we can move forward with the code review, please sign the CLA Authorization - EasyCLA

➡️ You only need to sign the CLA once, when you are submitting your first full request to the OpenCue project. After that, you’re all set for future contributions.

Everyone’s collaboration is very much appreciated. Thank you for helping grow OpenCue and for contributing to the software our industry relies on!

ramonfigueiredo avatar Sep 12 '25 18:09 ramonfigueiredo

FYI ...

The OpenCue sandbox provides a quick way to run Cuebot, RQD, CueGUI, and CueSubmit locally for testing. This environment is ideal for developers who want to test changes, experiment with features, or learn how OpenCue works without setting up a full production environment.

ramonfigueiredo avatar Sep 18 '25 21:09 ramonfigueiredo

Note:

Thanks for your interest in solving this issue!

This issue #1609 is for the Python version of RQD, since it was created before the Rust RQD implementation existed, so the expectation is for the change to go into Python first, with relevant fixes carried over to Rust later. Feel free to include the same solution on the RQD Rust version as well.

ramonfigueiredo avatar Sep 19 '25 18:09 ramonfigueiredo

Hi @ramonfigueiredo,

Thanks for the clarification!
I haven’t worked with Rust before, so for Dev Days I will focus on fixing this issue in the Python version of RQD first to get familiar with the OpenCue development workflow. If possible, I might later look into the Rust version in a future ticket.

Could you please assign this issue to me? Thanks!

KihangPark avatar Sep 19 '25 19:09 KihangPark

Hi @KihangPark,

Thanks for volunteering to take this on! That sounds like a great plan. Starting with the Python implementation in RQD is the perfect way to get familiar with the OpenCue workflow.

I’ve assigned the issue to you. Looking forward to your contributions.

ramonfigueiredo avatar Sep 19 '25 23:09 ramonfigueiredo

I'm currently implementing the feature for this issue and have done manual testing. The tests appear successfully. I'm waiting for CLA configuration and plan to submit a PR soon. Before doing so, I'd like to document my testing approach here and ask for feedback on whether there are more appropriate or simpler verification methods. Actually my testing process required several adjustments that I found challenging, likely due to my unfamiliarity with the system. Any guidance would be greatly appreciated!

KihangPark avatar Sep 24 '25 05:09 KihangPark

Testing rqcore.runLinux Functionality

After updating the code, I conducted verification as follows. Please let me know if there are any areas that need improvement, mistakes, or simpler methods.

Since docker-compose.yml directly references opencue/rqd, I needed to locally build RQD with the updated code. Looking at the RQD Dockerfile, it creates and copies wheels, so I used the following setup to create a local RQD:

#!/usr/bin/env bash
source sandbox-venv/bin/activate
# Ensure build tool is available
python -m pip -q install -U build
# Build wheels (quiet)
python -m build proto/ >/dev/null
python -m build rqd/   >/dev/null
# Resolve produced wheel paths
PROTO_WHL=$(ls -1 proto/dist/opencue_proto-*.whl | head -n 1 || true)
RQD_WHL=$(ls -1 rqd/dist/opencue_rqd-*.whl     | head -n 1 || true)
# Build dev image embedding the wheels
docker build -f rqd/Dockerfile \
  --build-arg OPENCUE_PROTO_PACKAGE_PATH="${PROTO_WHL}" \
  --build-arg OPENCUE_RQD_PACKAGE_PATH="${RQD_WHL}" \
  -t opencue/rqd:dev .

After building the local RQD, I made the following modifications to docker-compose.yml:

rqd:
-    image: opencue/rqd
+    image: opencue/rqd:dev
     environment:
       - PYTHONUNBUFFERED=1
       - CUEBOT_HOSTNAME=cuebot
     volumes:
-      - /tmp/rqd/logs:/tmp/rqd/logs
-      - /tmp/rqd/shots:/tmp/rqd/shots
+      - ./tmp/rqd/logs:/tmp/rqd/logs
+      - ./tmp/rqd/shots:/tmp/rqd/shots
+      - ./sandbox/rqd.conf.dev:/etc/opencue/rqd.conf:ro

Since this ticket requires configuration via rqd.conf, I loaded the following rqd.conf.dev:

[Override]
USE_NIMBY_PYNPUT=false
JOB_LOG_MAX_SIZE_IN_BYTES=1048576

After startup, I submitted a test job and confirmed that it was properly killed when exceeding the log limit. The log files generated showed the expected behavior, with proper rotation (.1 files) and clear kill messages indicating the size limit was exceeded.

$ ls ./tmp/rqd/logs/testing/test_shot/logs/testing-test_shot-math_test_job15--0b815d4b-53f8-4e60-b18a-fd39eca80e05/

testing-test_shot-math_test_job15.0001-test_job.rqlog    
testing-test_shot-math_test_job15.0002-test_job.rqlog    
testing-test_shot-math_test_job15.0003-test_job.rqlog
testing-test_shot-math_test_job15.0001-test_job.rqlog.1  
testing-test_shot-math_test_job15.0002-test_job.rqlog.1  
testing-test_shot-math_test_job15.0003-test_job.rqlog.1
...
x
Job log size exceeded limit: 1048578 bytes > 1048576 bytes. Log: /tmp/rqd/logs/testing/test_shot/logs/testing-test_shot-math_test_job15--0b815d4b-53f8-4e60-b18a-fd39eca80e05/testing-test_shot-math_test_job15.0001-test_job.rqlog. 
Terminating job.x
x
x
x
x
x
x
x
===========================================================
RenderQ Job Complete

exitStatus          1
exitSignal          9
killMessage         Job log size exceeded limit: 1048847 bytes > 1048576 bytes. Log: /tmp/rqd/logs/testing/test_shot/logs/testing-test_shot-math_test_job15--0b815d4b-53f8-4e60-b18a-fd39eca80e05/testing-test_shot-math_test_job15.0001-test_job.rqlog. Terminating job.
startTime           Wed Sep 24 03:55:53 2025
endTime             Wed Sep 24 03:56:37 2025
...
===========================================================

KihangPark avatar Sep 24 '25 05:09 KihangPark

Testing runDocker Functionality

For runDocker, Docker configuration was required, so I made the following modifications: rqd.conf.dev:

[Override]
USE_NIMBY_PYNPUT=false
JOB_LOG_MAX_SIZE_IN_BYTES=1048576
[docker.config]
RUN_ON_DOCKER=True
DOCKER_SHELL_PATH=/usr/bin/sh
[docker.mounts]
TEMP=type:bind,source:/tmp,target:/tmp
[docker.images]
rqd=opencue/rqd:latest

Modified docker-compose.yml:

rqd:
-    image: opencue/rqd
+    image: opencue/rqd:dev
+    pid: "host"
     environment:
     ports:
       - "8444:8444"
     volumes:
-      - /tmp/rqd/logs:/tmp/rqd/logs
-      - /tmp/rqd/shots:/tmp/rqd/shots
+      - /tmp:/tmp
+      - ./tmp/rqd/logs:/tmp/rqd/logs
+      - ./tmp/rqd/shots:/tmp/rqd/shots
+      - ./sandbox/rqd.conf.dev:/etc/opencue/rqd.conf:ro
+      - /var/run/docker.sock:/var/run/docker.sock:rw

Since Docker installation was required inside the container, I modified the Dockerfile:

+# Install Python Docker SDK for RUN_ON_DOCKER runtime support inside the RQD container
+RUN python3.9 -m pip install docker==7.1.0

The jobs were properly killed and logs generated as expected. However, one concern is that when using Docker mode, killed jobs show a state of "Finished" rather than indicating failure, despite being terminated due to log size limits. I'm treating this as a separate issue for now, as it might be related to how Docker-based jobs are handled differently. (But let me know if need to be handled in this ticket together.)

The core functionality appears to work correctly - jobs are terminated when log size exceeds the limit, and proper kill messages are logged.

$ ls -sl ./tmp/rqd/logs/testing/test_shot/logs/testing-test_shot-math_test_job19--df2e5e1f-b0a5-4b74-a9ae-25cab25d073e/
total 3084
1028 -rw-rw-rw-. 1 root root 1049529 Sep 23 21:47 testing-test_shot-math_test_job19.0001-test_job.rqlog
1028 -rw-rw-rw-. 1 root root 1049528 Sep 23 21:47 testing-test_shot-math_test_job19.0002-test_job.rqlog
1028 -rw-rw-rw-. 1 root root 1049528 Sep 23 21:47 testing-test_shot-math_test_job19.0003-test_job.rqlog
...
===========================================================
DOCKER_ENTRYPOINT = #!/bin/sh
useradd -u 1000 -g 20 -p [password] math >/dev/null 2>&1 || true

exec su -s /usr/bin/sh math -c "echo \$$
  /usr/bin/time -p -o /tmp/rqd-stat-6c114171-9083-479e-8862-1e1e77a60bf0-1758689149.446497  yes x | head -n 2000000"
Container 97d46e7d83e8 started for testing-test_shot-math_test_job19.0001-test_job(6c114171-9083-479e-8862-1e1e77a60bf0) with pid 442280442308
x
...
x
x
x
Job log size exceeded limit: 1048578 bytes > 1048576 bytes. Log: /tmp/rqd/logs/testing/test_shot/logs/testing-test_shot-math_test_job19--df2e5e1f-b0a5-4b74-a9ae-25cab25d073e/testing-test_shot-math_test_job19.0001-test_job.rqlog. Terminating job.
===========================================================
RenderQ Job Complete

exitStatus          0
exitSignal          0
killMessage         Job log size exceeded limit: 1048578 bytes > 1048576 bytes. Log: /tmp/rqd/logs/testing/test_shot/logs/testing-test_shot-math_test_job19--df2e5e1f-b0a5-4b74-a9ae-25cab25d073e/testing-test_shot-math_test_job19.0001-test_job.rqlog. Terminating job.
...

KihangPark avatar Sep 24 '25 05:09 KihangPark

Some Questions for Confirmation

  1. Platform Coverage: This ticket requests killing jobs when logs become too large. I found that RQD has multiple execution modes: normal Linux execution (rqcore.runLinux), Docker execution (rqcore.runDocker, rqcore.recoverDocker), and also Windows and Darwin modes (rqcore.runDarwin, rqcore.runWindows). Should the same log size limiting functionality be applied to all platforms, or is Linux/Docker coverage sufficient for now?

  2. Docker Image Compatibility: During runDocker testing, I encountered a "Bad fd number" error and had to modify the following code in rqcore.py to make it work:

-useradd -u %s -g %s -p %s %s >& /dev/null || true;
+useradd -u %s -g %s -p %s %s >/dev/null 2>&1 || true;

This is likely related to the Docker image I used for testing. What Docker image should I use to avoid this issue without code modifications? Is there a recommended base image - [docker.images] in rqd.conf for RQD Docker testing?

KihangPark avatar Sep 24 '25 05:09 KihangPark

I am waiting CLA setup now, I am planning to submit PR with this updates as soon as possible.

https://github.com/AcademySoftwareFoundation/OpenCue/compare/master...KihangPark:OpenCue:feature/1609-rqd-limit-log-size

KihangPark avatar Sep 25 '25 01:09 KihangPark

Hi @ramonfigueiredo,

Sorry for delay about creating PR. (I am still waiting for CLA setup from company.) I will submit PR after I got setup.

KihangPark avatar Sep 26 '25 18:09 KihangPark

Hi @ramonfigueiredo,

Sorry for delay about creating PR. (I am still waiting for CLA setup from company.) I will submit PR after I got setup.

No problem. Thanks!

ramonfigueiredo avatar Sep 29 '25 21:09 ramonfigueiredo

@ramonfigueiredo the CLA should be approved for @KihangPark for any PR to this project.

dekekincaid avatar Oct 01 '25 07:10 dekekincaid

@ramonfigueiredo the CLA should be approved for @KihangPark for any PR to this project.

Hi @KihangPark,

Please try to submit your PR. We will review it asap

ramonfigueiredo avatar Oct 01 '25 16:10 ramonfigueiredo

@DiegoTavares FYI

ramonfigueiredo avatar Oct 01 '25 16:10 ramonfigueiredo

Some Questions for Confirmation

  1. Platform Coverage: This ticket requests killing jobs when logs become too large. I found that RQD has multiple execution modes: normal Linux execution (rqcore.runLinux), Docker execution (rqcore.runDocker, rqcore.recoverDocker), and also Windows and Darwin modes (rqcore.runDarwin, rqcore.runWindows). Should the same log size limiting functionality be applied to all platforms, or is Linux/Docker coverage sufficient for now?
  2. Docker Image Compatibility: During runDocker testing, I encountered a "Bad fd number" error and had to modify the following code in rqcore.py to make it work:
-useradd -u %s -g %s -p %s %s >& /dev/null || true;
+useradd -u %s -g %s -p %s %s >/dev/null 2>&1 || true;

This is likely related to the Docker image I used for testing. What Docker image should I use to avoid this issue without code modifications? Is there a recommended base image - [docker.images] in rqd.conf for RQD Docker testing?

Same answers in the PR:

  • https://github.com/AcademySoftwareFoundation/OpenCue/pull/2019

Question 1: Platform Coverage

Yes, you should add the same log size limiting functionality to runWindows (line 1344) and runDarwin (line 1400). Both methods have similar output reading loops (see lines 1375-1380 for Windows and 1430-1435 for Darwin) where they write to rqlog.

The issue description doesn't limit the scope to Linux only, and DCC jobs can run on any platform. For consistency and completeness, all platforms should have this protection.

Question 2: Docker Shell Compatibility

The >& /dev/null syntax is bash-specific and doesn't work in POSIX-compliant shells like dash (often linked as /bin/sh). Your fix to use >/dev/null 2>&1 is correct and should be applied. However, this appears to be a pre-existing issue in the codebase, not directly related to this PR. You could either:

  • Include it as a small fix in this PR (with a note in the commit message)
  • File it as a separate issue/PR (Create the new issue on OpenCue)

For Docker image compatibility, the code uses DOCKER_SHELL_PATH=/usr/bin/sh from the config. The base image should have a POSIX-compliant shell at that path. The opencue/rqd:latest image should work, but the shell redirection issue needs fixing regardless.

ramonfigueiredo avatar Oct 02 '25 01:10 ramonfigueiredo