Submitting a task with requiredSlots greater than an autoscaling pool's taskSlotsPerNode causes the pool to scale up and hang in an idle state

Open okofish opened this issue 4 years ago • 1 comments

Problem Description

It's possible to submit a task that requires more task slots than the number of task slots per node configured at the pool level. If the pool uses an autoscale formula based on the number of pending tasks, the pool will scale up and remain scaled up indefinitely, without scheduling the task to a node.

Steps to Reproduce

Create a new pool using the Standard_D4_v3 VM size.

It doesn't matter whether it's set for container-based tasks or not.

Enable autoscaling with the following formula: (Example 2 from the documentation on pool autoscaling, tweaked slightly for a reduced sampling interval)

// Get pending tasks for the past 5 minutes.
$samples = $PendingTasks.GetSamplePercent(TimeInterval_Minute * 5);
// If we have fewer than 70 percent data points, we use the last sample point,
// otherwise we use the maximum of last sample point and the history average.
$tasks = $samples < 70 ? max(0,$PendingTasks.GetSample(1)) : max( $PendingTasks.GetSample(1), avg($PendingTasks.GetSample(TimeInterval_Minute * 5)));
// If number of pending tasks is not 0, set targetVM to pending tasks, otherwise
// half of current dedicated.
$targetVMs = $tasks > 0? $tasks:max(0, $TargetDedicatedNodes/2);
// The pool size is capped at 20, if target VM value is more than that, set it
// to 20. This value should be adjusted according to your use case.
$TargetDedicatedNodes = max(0, min($targetVMs, 20));
// Set node deallocation mode - let running tasks finish before removing a node
$NodeDeallocationOption = taskcompletion;

Set the autoscale formula evaluation interval to 5 minutes.
Set "Task slots per node" to 4.

Create a new job in the pool you just created.
Create a new task in the job you just created.
- The command line doesn't matter, because the task will never get run.
- Set "Required slots" (requiredSlots) to 8.
- The task must be created via the API or Batch Explorer, as the Batch portal (rightly) won't let you set "Required slots" greater than "Task slots per node".
Wait ~5 minutes.

Expected Results

The pool does not scale up (because doing so would be pointless; a 8-slot task cannot be scheduled onto a 4-slot node.)

Actual Results

The pool scales up to 1 node, which remains idle indefinitely.

Additional Comments

This behavior can also be demonstrated using the autoscaling formula simulation API, but it's more fun to try it for real and watch the pool burn money before your eyes 😄

I think the core issue here isn't actually with autoscaling, but that seems to be where the behavior is most visibly problematic. Since the taskSlotsPerNode pool setting is immutable, a (non-multi-instance) task with requiredSlots greater than its pool's taskSlotsPerNode is unschedulable and will never be schedulable. I think the Task_Add and Task_AddCollection operations should simply fail if requiredSlots is > taskSlotsPerNode. As mentioned above, the Batch portal already recognizes this implicit constraint and makes it explicit.

Oct 26 '21 19:10 okofish

I found a section in the documention that advises against setting requiredSlots > taskSlotsPerNode and gives a rationale for not enforcing this at submission time:

Be sure you don't specify a task's requiredSlots to be greater than the pool's taskSlotsPerNode. This will result in the task never being able to run. The Batch Service doesn't currently validate this conflict when you submit tasks because a job may not have a pool bound at submission time, or it could be changed to a different pool by disabling/re-enabling.

This rationale is sensible; I had not considered that a task could be reassigned to a different pool. However, barring some other solution I haven't considered, I believe the current behavior's conflict with autoscaling is severe enough to necessitate some way of preventing the state where Batch scales up a node that remains idle forever.

Two possible ideas:

Add a flag to the Task_Add and Task_AddCollection operations that, when enabled, causes the operation to fail if requiredSlots is greater than taskSlotsPerNode. I would opine that this flag should be enabled by default, but either way works.
Task metrics in autoscale formulas ($PendingTasks, $ActiveTasks, etc.) could completely ignore tasks where requiredSlots is greater than taskSlotsPerNode. I think this seems like a rather clunky solution, but it would solve the main problem with autoscaling.
- If the feature request in #118 is implemented, the proposed task-slot-wise metrics should have this same behavior.

Oct 26 '21 20:10 okofish