Code fixes for local-storage-only environment
In certain virtualized environment there is no shared storage. Both source code and data are stored (replicated) in each worker node's local storage. The code sections below only load data or build/load binaries on local directory. Other nodes cannot see master's file path.
https://github.com/NVIDIA/Megatron-LM/blob/cd2537d444792b487b1ab5a6fa685e09c9957409/megatron/data/gpt_dataset.py#L310-L317
(same for dataset builders other than GPT)
https://github.com/NVIDIA/Megatron-LM/blob/cd2537d444792b487b1ab5a6fa685e09c9957409/megatron/initialize.py#L92-L100
https://github.com/NVIDIA/Megatron-LM/blob/cd2537d444792b487b1ab5a6fa685e09c9957409/megatron/initialize.py#L126-L130
https://github.com/NVIDIA/Megatron-LM/blob/cd2537d444792b487b1ab5a6fa685e09c9957409/megatron/initialize.py#L140-L143
Changing torch.distributed.get_rank() to torch.distributed.get_rank() % torch.cuda.device_count() fixes the problem, by having one process on each node accessing its local disk. Of course here assumes each node has the same number of devices. Otherwise need to pass os.environ['LOCAL_RANK'] from the main script to identify the local rank.
Marking as stale. No activity in 60 days.
I have a similar problem. My cluster has a relatively slow shared storage system, so I want to copy dataset to compute node temporary storage system. However, I found that, Megatron will first build a data index cache only on rank=0 gpu, so other node can not access this data cache file, which will return FileNotFoundError.
Marking as stale. No activity in 60 days.
I have a similar problem. My cluster has a relatively slow shared storage system, so I want to copy dataset to compute node temporary storage system. However, I found that, Megatron will first build a data index cache only on rank=0 gpu, so other node can not access this data cache file, which will return FileNotFoundError.
@jindajia The same issue. Could you please advise on how to resolve it?