🚀 Feature: Scaling simulation with Zero Infinity
🔖 Feature description
Support Deep speed Zero Infinity in the multiprocessing job executor
🎤 Pitch
The actor system can support large-scale worker modeling. However, the ML Job Executor is still limited to the GPU memory in spawning new processes. To scale up to larger FL simulations, we have to use CPU-bound and nVME memory along with the GPU computations. Zero-Infinity proposes an offloading engine to train a Trillion Parameter model over a single NVIDIA DGX-2 pod.
This can help us remove the memory limitations from GPU to support larger parallel FL workloads.
📖 Additional Content
Abstract of the Paper
In this paper, we present ZeRO-Infinity, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring. At the same time, it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth. ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs (40% of peak), while also demonstrating super linear scalability. An open source implementation of ZeRO-Infinity is available through DeepSpeed
👀 Have you spent some time to check if this issue has been raised before?
- [X] I checked and didn't find similar issue
🏢 Have you read the Code of Conduct?
- [X] I have read the Code of Conduct