[Core] ray.init() overrides sigterm handler and causes an error in torch.compile
What happened + What you expected to happen
Upon startup, I noticed a weird stack trace. The stack trace comes from a conflict of two things:
- When Ray starts up, it does a "signal monkey patch", where it prevents sending
SIGINTbeing set (code). - When torch initiates compilation, it tries to set its own no-op
SIGINThandler to avoid annoying output logs (code).
One way to fix this is to turn off asynchronous torch compilation by setting TORCHINDUCTOR_COMPILE_THREADS to 1 (code). Empirically I verified this, but it doesn't seem good to force the torch compilation to be synchronous.
Is it possible to keep the torch compilation thread pool and avoid this exception when using Ray?
| (RayTrainWorker pid=6243) Exception in initializer: [repeated 247x across cluster] | (RayTrainWorker pid=6243) Exception in initializer: [repeated 247x across cluster]
-- | -- | --
| | (RayTrainWorker pid=6243) Traceback (most recent call last): [repeated 247x across cluster]
| File "/opt/conda/lib/python3.11/concurrent/futures/process.py", line 240, in _process_worker [repeated 247x across cluster] | (RayTrainWorker pid=6243) File "/opt/conda/lib/python3.11/concurrent/futures/process.py", line 240, in _process_worker [repeated 247x across cluster]
| initializer(*initargs) [repeated 247x across cluster] | (RayTrainWorker pid=6243) initializer(*initargs) [repeated 247x across cluster]
| File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2554, in _async_compile_initializer [repeated 247x across cluster] | (RayTrainWorker pid=6243) File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2554, in _async_compile_initializer [repeated 247x across cluster]
| signal.signal(signal.SIGINT, signal.SIG_IGN) [repeated 247x across cluster] | (RayTrainWorker pid=6243) signal.signal(signal.SIGINT, signal.SIG_IGN) [repeated 247x across cluster]
| File "/opt/conda/lib/python3.11/site-packages/ray/_private/utils.py", line 1879, in _signal_monkey_patch [repeated 247x across cluster] | (RayTrainWorker pid=6243) File "/opt/conda/lib/python3.11/site-packages/ray/_private/utils.py", line 1879, in _signal_monkey_patch [repeated 247x across cluster]
| raise ValueError( [repeated 247x across cluster] | (RayTrainWorker pid=6243) raise ValueError( [repeated 247x across cluster]
| ValueError: Can't set signal handler for SIGINT while SIGINT is being deferred within a DeferSigint context. [repeated 247x across cluster]
Versions / Dependencies
PyTorch 2.3.0, Ray 2.32.0
Reproduction script
import ray.train.torch
from ray.train import RunConfig
# WARNING: I have not directly tested this reduced script.
def train_func(config):
do_my_training()
def train_model():
num_gpus = int(ray.available_resources().get("GPU", 0))
scaling_config = ray.train.ScalingConfig(num_workers=num_workers, use_gpu=True)
run_config = RunConfig(name='some_name')
trainer = ray.train.torch.TorchTrainer(
train_func, train_loop_config=train_func_config, scaling_config=scaling_config, run_config=run_config
)
trainer.fit()
ray.init()
train_model()
Issue Severity
Low: It annoys or frustrates me.
We should warp user code's sigint handler with our _set_task_cancelled and afterwards, we should restore to user's sigint handler only.
Thanks @jjyao for your consideration. Can you say more about how I might be able to workaround this? Because of some slight changes to how I am using Ray, this has gone from an annoying log to a blocker. In particular, I believe there is some race condition where when I call ray.init() earlier in my procedure, the delay until I start my training causes this to fail training entirely.
@MortalHappiness PTAL on this issue. We can discuss it