Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

Code fixes for local-storage-only environment

Open learning-chip opened this issue 2 years ago • 2 comments

In certain virtualized environment there is no shared storage. Both source code and data are stored (replicated) in each worker node's local storage. The code sections below only load data or build/load binaries on local directory. Other nodes cannot see master's file path.

https://github.com/NVIDIA/Megatron-LM/blob/cd2537d444792b487b1ab5a6fa685e09c9957409/megatron/data/gpt_dataset.py#L310-L317

(same for dataset builders other than GPT)

https://github.com/NVIDIA/Megatron-LM/blob/cd2537d444792b487b1ab5a6fa685e09c9957409/megatron/initialize.py#L92-L100

https://github.com/NVIDIA/Megatron-LM/blob/cd2537d444792b487b1ab5a6fa685e09c9957409/megatron/initialize.py#L126-L130

https://github.com/NVIDIA/Megatron-LM/blob/cd2537d444792b487b1ab5a6fa685e09c9957409/megatron/initialize.py#L140-L143

Changing torch.distributed.get_rank() to torch.distributed.get_rank() % torch.cuda.device_count() fixes the problem, by having one process on each node accessing its local disk. Of course here assumes each node has the same number of devices. Otherwise need to pass os.environ['LOCAL_RANK'] from the main script to identify the local rank.

learning-chip avatar May 30 '23 04:05 learning-chip

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jul 29 '23 18:07 github-actions[bot]

I have a similar problem. My cluster has a relatively slow shared storage system, so I want to copy dataset to compute node temporary storage system. However, I found that, Megatron will first build a data index cache only on rank=0 gpu, so other node can not access this data cache file, which will return FileNotFoundError.

jindajia avatar Apr 11 '24 02:04 jindajia

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jun 10 '24 18:06 github-actions[bot]

I have a similar problem. My cluster has a relatively slow shared storage system, so I want to copy dataset to compute node temporary storage system. However, I found that, Megatron will first build a data index cache only on rank=0 gpu, so other node can not access this data cache file, which will return FileNotFoundError.

@jindajia The same issue. Could you please advise on how to resolve it?

MAxx8371 avatar Jul 13 '24 08:07 MAxx8371