ray [Core] ray.init() overrides sigterm handler and causes an error in torch.compile

What happened + What you expected to happen

Upon startup, I noticed a weird stack trace. The stack trace comes from a conflict of two things:

When Ray starts up, it does a "signal monkey patch", where it prevents sending SIGINT being set (code).
When torch initiates compilation, it tries to set its own no-op SIGINT handler to avoid annoying output logs (code).

One way to fix this is to turn off asynchronous torch compilation by setting TORCHINDUCTOR_COMPILE_THREADS to 1 (code). Empirically I verified this, but it doesn't seem good to force the torch compilation to be synchronous.

Is it possible to keep the torch compilation thread pool and avoid this exception when using Ray?

  | (RayTrainWorker pid=6243) Exception in initializer: [repeated 247x across cluster] | (RayTrainWorker pid=6243) Exception in initializer: [repeated 247x across cluster]
-- | -- | --
  |  | (RayTrainWorker pid=6243) Traceback (most recent call last): [repeated 247x across cluster]
  |    File "/opt/conda/lib/python3.11/concurrent/futures/process.py", line 240, in _process_worker [repeated 247x across cluster] | (RayTrainWorker pid=6243) File "/opt/conda/lib/python3.11/concurrent/futures/process.py", line 240, in _process_worker [repeated 247x across cluster]
  |      initializer(*initargs) [repeated 247x across cluster] | (RayTrainWorker pid=6243) initializer(*initargs) [repeated 247x across cluster]
  |    File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2554, in _async_compile_initializer [repeated 247x across cluster] | (RayTrainWorker pid=6243) File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2554, in _async_compile_initializer [repeated 247x across cluster]
  |      signal.signal(signal.SIGINT, signal.SIG_IGN) [repeated 247x across cluster] | (RayTrainWorker pid=6243) signal.signal(signal.SIGINT, signal.SIG_IGN) [repeated 247x across cluster]
  |    File "/opt/conda/lib/python3.11/site-packages/ray/_private/utils.py", line 1879, in _signal_monkey_patch [repeated 247x across cluster] | (RayTrainWorker pid=6243) File "/opt/conda/lib/python3.11/site-packages/ray/_private/utils.py", line 1879, in _signal_monkey_patch [repeated 247x across cluster]
  |      raise ValueError( [repeated 247x across cluster] | (RayTrainWorker pid=6243) raise ValueError( [repeated 247x across cluster]
  |  ValueError: Can't set signal handler for SIGINT while SIGINT is being deferred within a DeferSigint context. [repeated 247x across cluster]

Versions / Dependencies

PyTorch 2.3.0, Ray 2.32.0

Reproduction script

import ray.train.torch
from ray.train import RunConfig

# WARNING: I have not directly tested this reduced script.

def train_func(config):
    do_my_training()

def train_model():
    num_gpus = int(ray.available_resources().get("GPU", 0))
    scaling_config = ray.train.ScalingConfig(num_workers=num_workers, use_gpu=True)

    run_config = RunConfig(name='some_name')
    trainer = ray.train.torch.TorchTrainer(
        train_func, train_loop_config=train_func_config, scaling_config=scaling_config, run_config=run_config
    )
    trainer.fit()

ray.init()
train_model()

Issue Severity

Low: It annoys or frustrates me.

Aug 07 '24 02:08 mritterfigma

We should warp user code's sigint handler with our _set_task_cancelled and afterwards, we should restore to user's sigint handler only.

Aug 12 '24 21:08 jjyao

Thanks @jjyao for your consideration. Can you say more about how I might be able to workaround this? Because of some slight changes to how I am using Ray, this has gone from an annoying log to a blocker. In particular, I believe there is some race condition where when I call ray.init() earlier in my procedure, the delay until I start my training causes this to fail training entirely.

Aug 28 '24 18:08 mritterfigma

@MortalHappiness PTAL on this issue. We can discuss it

Oct 10 '24 07:10 rynewang