backend.ai icon indicating copy to clipboard operation
backend.ai copied to clipboard

Add a kernel environment variable to specify local rank

Open adrysn opened this issue 3 years ago • 0 comments

Main idea

Each session container embeds useful environment variables for cluster session:

BACKENDAI_CLUSTER_HOST="main1"
BACKENDAI_CLUSTER_HOSTS="sub1,main1"
BACKENDAI_CLUSTER_IDX="1"
BACKENDAI_CLUSTER_REPLICAS="main:1,sub:1"
BACKENDAI_CLUSTER_ROLE="main"
BACKENDAI_CLUSTER_SIZE="2"

However, there is no variable that can be used as the local index of the current container. It would be helpful to add such a variable for multi-node training scripts to auto-detect the current container's local rank.

Alternative ideas

Add a new variable likeBACKENDAI_CLUSTER_LOCAL_RANK. The value of it would be 0 for the main1 container, 1 for the sub1 container, 2 for the sub2 container, etc.

Anything else?

No response

adrysn avatar Oct 19 '22 14:10 adrysn