SLURM cluster with multiple batch pools assigned to a single partition
Feature Request Description
I would like to be able to assign multiple batch pools (== VM sizes) to a single SLURM partition. This way, SLURM should be able to do resource management using the --mem and --cpus-per-task flags. Currently attempting submit an sbatch/srun job to using these flags with the unmodified slurm.conf generated by shipyard fails (eg, srun: error: Unable to allocate resources: Requested node configuration is not available).
Currently jobs can only be targeted to a specific batch pool via --partition or --constraint flags, since the NodeName= lines in the generated slurm.conf don't contain resource specifications like CoresPerSocket or RealMemory.
I'd like to be able to use a configuration like this. Two (or more) pre-created batch pools x32core64G and x8core16G (VM types STANDARD_F32s_v2 and STANDARD_F8s_v2) mapping to a single SLURM partition mypartition (trimmed example):
slurm:
slurm_options:
elastic_partitions:
mypartition:
batch_pools:
x32core64G:
compute_node_type: dedicated
max_compute_nodes: 2
weight: 4
reclaim_exclude_num_nodes: 0
x8core16G:
compute_node_type: low_priority
max_compute_nodes: 4
weight: 3
reclaim_exclude_num_nodes: 0
default: true
max_runtime_limit: 7.00:00:00
I can manually add CoresPerSocket or RealMemory values to slurm.conf on the login and controller node, restart slurmctld and submit jobs using --mem and --cpus-per-task.
However I find that with a default single partition (mypartition) mapping to multiple batch pools, only the final batch pool (x8core16G in this case) ever receives jobs and autoscales. I believe this is because the Table shipyardslurm only holds a single BatchPoolId per partition (the final one defined in slurm.yaml/slurm.conf), so only a single batch pool ever autoscales in this case ?
I'd like to be able to use a configuration like this with a single default partition, multiple batch pools, and have SLURM / shipyard automatically assign jobs to the correct node type / batch pool based on the --cpu-per-task and --mem flags.
Describe Preferred Solution
-
Shipyard should query VM specifications for each batch pool and add
CorePerSocketandRealMemory(or similar) values to eachNodeNameline in the generatedslurm.conf. -
Make the autoscaling / powersaving scripts (eg
/var/batch-shipyard/slurm.py, theshipyardslurmTable partition to batch pool mappings) work when a partition maps to multiple batch pools. Unsure of exactly the changes required to make this part work.