batch-shipyard icon indicating copy to clipboard operation
batch-shipyard copied to clipboard

Pool resize with NC24rs_v3 fails to find PKEYS during nodeprep

Open themorey opened this issue 5 years ago • 1 comments

Problem Description

Creating a multi-instance pool with NC24rs_v3 fails during start prep as it is looking for the mlx5_0 in shipyard_nodeprep.sh lines 1609-1612:

export_ib_pkey()
{
    key0=$(cat /sys/class/infiniband/mlx5_0/ports/1/pkeys/0)
    key1=$(cat /sys/class/infiniband/mlx5_0/ports/1/pkeys/1)

The NC24rs_v3 has the ConnectX3 card and is identified as mlx4_0 not mlx5_0. Manually modifying shipyard_nodeprep.sh each time a pool is created will workaround the issue.

Batch Shipyard Version

3.9.1 (Mac)

Steps to Reproduce

Resize a multi-instance pool containing NC24rs_v3 and wait for it to fail.

Expected Results

Node finds the PKEYS and boots normally without intervention.

Actual Results

Manual intervention is required each time a pool is created or modified.

Redacted Configuration

 pool_specification:
    id: arvinas-relion-pool-NCv3
    vm_configuration:
      platform_image:
       offer: CentOS-HPC
       publisher: OpenLogic
       sku: '7.7'
       version: '7.7.2020062600'
   vm_count:
     dedicated: 0
     low_priority: 0
   vm_size: STANDARD_NC24rs_v3
   autoscale:
     evaluation_interval: 00:05:00
     scenario:
       name: active_tasks
       maximum_vm_count:
         dedicated: 4
         low_priority: 4
       maximum_vm_increment_per_evaluation:
         dedicated: -1
         low_priority: -1
       bias_node_type: low_priority
   inter_node_communication_enabled: true
   virtual_network:
     arm_subnet_id: /subscriptions/{sub}/resourceGroups/{RG}/providers/Microsoft.Network/virtualNetworks/{Vnet}/subnets/{sn}
   ssh:
     username: shipyard

themorey avatar Nov 04 '20 18:11 themorey

It looks like the environment variable SHIPYARD_USER_CMD in the file .shipyard.envlist is also hardcoded as UCX_NET_DEVICES=mlx5_0:1. This causes multinode MPI jobs to fail with Gen1 VMs that have mlx4 devices.

themorey avatar Nov 05 '20 15:11 themorey