Develop an improved understanding of interactions between `--ntasks`, `srun` and other related Slurm features
What would you like to see added?
Matt pointed out that --ntasks impacts the output of nproc on our system, and this led to a discussion about when --ntasks might impact downstream processing. As far as we can tell, applications are free to ignore the "advice" given by --ntasks to downstream processes, however, --ntasks has implications for how srun is used to manage tasks.
- Here is a source for much more information: https://slurm.schedmd.com/mc_support.html
Here are sources tangentially related:
- https://stackoverflow.com/questions/75782584/multiprocessing-with-slurm-increasing-number-of-cpus-per-ask-works-but-not-incr/75800054#75800054
- https://unix.stackexchange.com/questions/427705/why-nproc-show-less-than-nproc-all
As a couple of practical examples, both nproc and the multiprocessing module of Python only use cores assigned to the invoking process. Using ntasks will allocate resources for a max of X user-invoked processes each assigned one core since cpus-per-task defaults to one. See the example of nproc below:
[mdefende@login004 ~]$ srun --ntasks=2 --pty /bin/bash
[mdefende@c0115 ~]$ nproc
1
[mdefende@login004 ~]$ srun --ntasks=1 --cpus-per-task=2 --pty /bin/bash
[mdefende@c0115 ~]$ nproc
2
If this only affected nproc, it wouldn't be very meaningful, however it does affect Python as well. See the example using a multiprocessing pool:
[mdefende@login004 ~]$ srun --time=12:00:00 --mem=8G --ntasks=4 --partition=amd-hdr100 --pty /bin/bash
[mdefende@c0235 ~]$ python # using Python 3.13 interpreter from a conda env (not shown)
>>> import multiprocessing
>>> pool = multiprocessing.Pool()
>>> pool
<multiprocessing.pool.Pool state=RUN pool_size=1>
[mdefende@login004 ~]$ srun --ntasks=1 --cpus-per-task=4 --pty /bin/bash
[mdefende@c0235 ~]$ python # using Python 3.13 interpreter from a conda env (not shown)
>>> import multiprocessing
>>> pool = multiprocessing.Pool()
>>> pool
<multiprocessing.pool.Pool state=RUN pool_size=4>
So there could be cases where users start a job with ntasks=4 and then are only able to use a single core instead of the 4 allocated cores.
It's important to make a note about Python here. For Python <3.13 when creating the pool, you can set the pool size as an argument, otherwise it defaults to a number of cores on the node, not the number allocated to the job. Python 3.13 made the change to where the Pool size defaults to the number of cores accessible by the Python process. That's why the example here uses Python 3.13, to get a better representation of the actual pool size. This can also be confirmed for all Python 3 versions using the os.sched_getaffinity(0) function which returns the set of logical cores accessible by the Python process. Using ntasks=4 returns a set of one while using cpus-per-task=4 returns a set of 4.
Another fun ntasks vs cpus-per-task behavior. If you're requesting a multi-GPU job, specifying ntasks with cpus-per-task can result in cores with affinity for only 1 GPU to be assigned to the job. For instance, I'm creating a Dask job on the amperenodes with 2 GPUs and 64 cores. I thought it would be fine to create 2 total tasks with 32 CPUs per task (1 task managing data I/O for each GPU). However, this resulted in cores only from socket 0 to be allocated to the job meaning GPU1 had no cores for data I/O. See the example below:
[mdefende@login004 ~]$ srun --time=12:00:00 --mem=360G --ntasks=2 --cpus-per-task=32 --partition=amperenodes --gres=gpu:2 --reservation=rc-gpfs --pty nvidia-smi topo -m
GPU0 GPU1 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS NODE SYS 0-63 0 N/A
GPU1 SYS X SYS NODE 1 N/A
NIC0 NODE SYS X SYS
NIC1 SYS NODE SYS X
You can see the CPU affinity reports 64 cores from GPU0 and none from GPU1. However, if you only specify ntasks, it evenly distributed the cores from both sockets:
[mdefende@login004 ~]$ srun --time=12:00:00 --mem=360G --ntasks=64 --partition=amperenodes --gres=gpu:2 --reservation=rc-gpfs --pty nvidia-smi topo -m
GPU0 GPU1 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS NODE SYS 0-31 0 N/A
GPU1 SYS X SYS NODE 64-95 1 N/A
NIC0 NODE SYS X SYS
NIC1 SYS NODE SYS X
This isn't the "ideal" way we'd like to teach people to request resources though based on previous discussions, so can we still use cpus-per-task while distributing the tasks evenly across sockets? Luckily, we can using the ntasks-per-socket option
[mdefende@login004 ~]$ srun --time=12:00:00 --mem=360G --ntasks=2 --cpus-per-task=32 --ntasks-per-socket=1 --partition=amperenodes --gres=gpu:2 --reservation=rc-gpfs --pty nvidia-smi topo -m
GPU0 GPU1 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS NODE SYS 0-31 0 N/A
GPU1 SYS X SYS NODE 64-95 1 N/A
NIC0 NODE SYS X SYS
NIC1 SYS NODE SYS X
So for the edge cases where a person is requesting multiple GPUs, ideally I think they would request a number of tasks equal to the number of GPUs and set ntasks-per-socket=1 .
We may need to reconfigure gres.conf to get the affinity working per this discussion: https://groups.google.com/g/slurm-users/c/-mMEauNGP9c
Will also need to work on the internals of OOD job creation.
There are also additional flags that look related like --cores-per-socket. We will need to carefully build an understanding of the interplay of these parts.
Related to #505