backend.ai
backend.ai copied to clipboard
Add a kernel environment variable to specify local rank
Main idea
Each session container embeds useful environment variables for cluster session:
BACKENDAI_CLUSTER_HOST="main1"
BACKENDAI_CLUSTER_HOSTS="sub1,main1"
BACKENDAI_CLUSTER_IDX="1"
BACKENDAI_CLUSTER_REPLICAS="main:1,sub:1"
BACKENDAI_CLUSTER_ROLE="main"
BACKENDAI_CLUSTER_SIZE="2"
However, there is no variable that can be used as the local index of the current container. It would be helpful to add such a variable for multi-node training scripts to auto-detect the current container's local rank.
Alternative ideas
Add a new variable likeBACKENDAI_CLUSTER_LOCAL_RANK. The value of it would be 0 for the main1 container, 1 for the sub1 container, 2 for the sub2 container, etc.
Anything else?
No response