Repeated CPU memory occupation for data loading
Is your feature request related to a problem? Please describe.
In LLM CPT/SFT distributed training, each rank independently loads the data into CPU memory. This leads to 8x CPU memory using if we have 8 GPU each node.
Releated code is here: https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/collections/llm/gpt/data/core.py#L718\
The code will be executed in GPTSFTPackedDataset initialization with each GPU rank. The data should be loaded only once in CPU memory, and each GPU rank has the access to this mem.
Hi @yspMing, if CPU memory is causing an issue for you, you can try memory mapping the file using the mmap option in numpy.load (see ref). Let me know if that works, if so we can add a config option to enable mmap.