uabrc.github.io Develop an improved understanding of interactions between `--ntasks`, `srun` and other related Slurm features

What would you like to see added?

Matt pointed out that --ntasks impacts the output of nproc on our system, and this led to a discussion about when --ntasks might impact downstream processing. As far as we can tell, applications are free to ignore the "advice" given by --ntasks to downstream processes, however, --ntasks has implications for how srun is used to manage tasks.

Here is a source for much more information: https://slurm.schedmd.com/mc_support.html

Here are sources tangentially related:

https://stackoverflow.com/questions/75782584/multiprocessing-with-slurm-increasing-number-of-cpus-per-ask-works-but-not-incr/75800054#75800054
https://unix.stackexchange.com/questions/427705/why-nproc-show-less-than-nproc-all

Dec 09 '24 18:12 wwarriner

As a couple of practical examples, both nproc and the multiprocessing module of Python only use cores assigned to the invoking process. Using ntasks will allocate resources for a max of X user-invoked processes each assigned one core since cpus-per-task defaults to one. See the example of nproc below:

[mdefende@login004 ~]$ srun --ntasks=2 --pty /bin/bash                                                                                                                                        
[mdefende@c0115 ~]$ nproc
1

[mdefende@login004 ~]$ srun --ntasks=1 --cpus-per-task=2 --pty /bin/bash                                                                                                                      
[mdefende@c0115 ~]$ nproc
2

If this only affected nproc, it wouldn't be very meaningful, however it does affect Python as well. See the example using a multiprocessing pool:

[mdefende@login004 ~]$ srun --time=12:00:00 --mem=8G --ntasks=4 --partition=amd-hdr100 --pty /bin/bash
[mdefende@c0235 ~]$ python # using Python 3.13 interpreter from a conda env (not shown)
>>> import multiprocessing
>>> pool = multiprocessing.Pool()
>>> pool
<multiprocessing.pool.Pool state=RUN pool_size=1>

[mdefende@login004 ~]$ srun  --ntasks=1 --cpus-per-task=4 --pty /bin/bash
[mdefende@c0235 ~]$ python # using Python 3.13 interpreter from a conda env (not shown)
>>> import multiprocessing
>>> pool = multiprocessing.Pool()
>>> pool
<multiprocessing.pool.Pool state=RUN pool_size=4>

So there could be cases where users start a job with ntasks=4 and then are only able to use a single core instead of the 4 allocated cores.

It's important to make a note about Python here. For Python <3.13 when creating the pool, you can set the pool size as an argument, otherwise it defaults to a number of cores on the node, not the number allocated to the job. Python 3.13 made the change to where the Pool size defaults to the number of cores accessible by the Python process. That's why the example here uses Python 3.13, to get a better representation of the actual pool size. This can also be confirmed for all Python 3 versions using the os.sched_getaffinity(0) function which returns the set of logical cores accessible by the Python process. Using ntasks=4 returns a set of one while using cpus-per-task=4 returns a set of 4.

Dec 13 '24 19:12 mdefende

Another fun ntasks vs cpus-per-task behavior. If you're requesting a multi-GPU job, specifying ntasks with cpus-per-task can result in cores with affinity for only 1 GPU to be assigned to the job. For instance, I'm creating a Dask job on the amperenodes with 2 GPUs and 64 cores. I thought it would be fine to create 2 total tasks with 32 CPUs per task (1 task managing data I/O for each GPU). However, this resulted in cores only from socket 0 to be allocated to the job meaning GPU1 had no cores for data I/O. See the example below:

[mdefende@login004 ~]$ srun --time=12:00:00 --mem=360G --ntasks=2 --cpus-per-task=32 --partition=amperenodes --gres=gpu:2 --reservation=rc-gpfs --pty nvidia-smi topo -m                                                      
        GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     NODE    SYS     0-63    0               N/A
GPU1    SYS      X      SYS     NODE            1               N/A
NIC0    NODE    SYS      X      SYS
NIC1    SYS     NODE    SYS      X

You can see the CPU affinity reports 64 cores from GPU0 and none from GPU1. However, if you only specify ntasks, it evenly distributed the cores from both sockets:

[mdefende@login004 ~]$ srun --time=12:00:00 --mem=360G --ntasks=64 --partition=amperenodes --gres=gpu:2 --reservation=rc-gpfs --pty nvidia-smi topo -m                                                                                                  
        GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     NODE    SYS     0-31    0               N/A
GPU1    SYS      X      SYS     NODE    64-95   1               N/A
NIC0    NODE    SYS      X      SYS
NIC1    SYS     NODE    SYS      X

This isn't the "ideal" way we'd like to teach people to request resources though based on previous discussions, so can we still use cpus-per-task while distributing the tasks evenly across sockets? Luckily, we can using the ntasks-per-socket option

[mdefende@login004 ~]$ srun --time=12:00:00 --mem=360G --ntasks=2 --cpus-per-task=32 --ntasks-per-socket=1 --partition=amperenodes --gres=gpu:2 --reservation=rc-gpfs --pty nvidia-smi topo -m                                                          
        GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     NODE    SYS     0-31    0               N/A
GPU1    SYS      X      SYS     NODE    64-95   1               N/A
NIC0    NODE    SYS      X      SYS
NIC1    SYS     NODE    SYS      X

So for the edge cases where a person is requesting multiple GPUs, ideally I think they would request a number of tasks equal to the number of GPUs and set ntasks-per-socket=1 .

Dec 17 '24 16:12 mdefende

We may need to reconfigure gres.conf to get the affinity working per this discussion: https://groups.google.com/g/slurm-users/c/-mMEauNGP9c

Will also need to work on the internals of OOD job creation.

There are also additional flags that look related like --cores-per-socket. We will need to carefully build an understanding of the interplay of these parts.

Dec 17 '24 17:12 wwarriner

Related to #505

Dec 18 '24 17:12 iam4tune