toil icon indicating copy to clipboard operation
toil copied to clipboard

SLURM Shebang Line Issues with slurm.py

Open jasongallant opened this issue 11 months ago • 1 comments

Hi There-

I'm trying to run progressiveCactus on our HPCC, which depends on toil. I'm running into an issue that I've been able to diagnose from the logs. Our HPCC combines several different architectures, and requires the users to submit jobs with a specific shebang line:

#!/bin/bash --login

https://docs.icer.msu.edu/Frequently_Asked_Questions_FAQ_/#why-did-i-get-an-illegal-instruction-error

As I understand the code here: https://github.com/DataBiosphere/toil/blob/4cc1707dc5b90eef38021bcbe15ef1e2163c02bd/src/toil/batchSystems/slurm.py#L299-L313

individual jobs are submitted using "wrap" which creates a job with an incompatible shebang line.

Any ideas on how to fix here so that we can get toil working on our cluster?

Thanks in advance!

Best, Jason Gallant

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1754

jasongallant avatar May 20 '25 13:05 jasongallant

To make this really work in your environment, we might need to know more about what /bin/bash --login does to the environment variables like PATH when it tries to set up the right binaries for each node's architecture. Even if we do successfully add the right setup steps to each Toil job, if Toil then resets the environment and applies the PATH from the leader node before running your actual work, you'll probably still get the wrong binaries running.

Toil applies the environment from the leader, and prepends the leader's PATH to the PATH on each worker, here: https://github.com/DataBiosphere/toil/blob/4cc1707dc5b90eef38021bcbe15ef1e2163c02bd/src/toil/worker.py#L338-L371

You also might have trouble with the Cactus binaries themselves; the Cactus workflow has a few different ways to get the Cactus binaries to actually run on the node, but I don't think any of them are multi-architecture-aware. So unless you have Cactus binaries that are built so they can run on all nodes, you could also still have trouble here. In particular I'm not sure you'd be able to support mixed ARM/amd64 architectures; I think the published Cactus binaries are built for a lowest-common-denominator amd64 architecture.

If all you need for your cluster is for Toil to run each job from a shell that was launched with bash --login, I think we would change [f"--wrap=exec {command}"] to [f"--wrap=exec bash --login -c \"exec {shlex.quote(command)}\""]. But I'm not sure what the best approach would be to make that change for your cluster specifically, without changing all clusters over to launching an extra login shell. Maybe just add this as a particular thing Toil can do with a flag to turn it on?

But if that does something like set up a PATH pointing to the appropriate architecture, or set an environment variable to a value different than what might be set on the leader, that's all going to be clobbered when Toil restores the environment for the job. We'd need to add whatever environment variable gets set to the list Toil ignores, and if it's PATH we'd need to figure out how the needed changes can be applied to the PATH while still working right with virtual environments and other PATH changes we need to take from the leader. (Does your system maybe have an API or a particular way for programs to work out that the system is available, what the current architecture is, and where binaries for that architecture live, without going through standard environment variables?)

You could also use TOIL_SLURM_ARGS or --slurmPartition to direct your jobs to nodes that have architectures that are the same as or compatible with the node you are running the Toil leader on.

adamnovak avatar May 20 '25 15:05 adamnovak