[RFC]: Tensor Initialization on Different Devices
Describe the feature
Currently, Colossal-AI requires at least PyTorch 1.8 at this is the lowest version which provides holistic communication operations. However, PyTorch 1.8 does not support directly initialize tensors on GPUs.
import torch
import torch.nn as nn
# allowed
layer = nn.Linear(128, 128)
# NOT allowed
layer = nn.Linear(128, 128, device=torch.cuda.current_device())
The current implementation forces the tensor-parallel layers to be initialized on GPUs by default. This is because that for TP layers, we may want them to initialize to different values on different ranks. This is achieved by forking into a different CUDA RNG states. One example is like below:
with seed(ParallelMode.TENSOR):
layer = colossalai.nn.Linear(128, 128) # layer is created on GPUs
For the sake of API consistency, we should initialize the TP layers on CPUs by default if no device is given. This brings two benefits:
-
colossalai.nnbehaves the same astorch.nnas tensors are initialized on CPU by default. TP layers will only be initialized on GPUs whendevice='cuda'. - This reduces the possibility of CUDA OOM when running pipeline partitioning. When partition the model into pipeline stages, one strategy is to count the number of parameters per layer and split accordingly. If all layers are initialized on GPUs, large models may lead to OOM straight.
The problem of CPU intialization is that we can't initialize them differently on different ranks. This is because we currently control the initialization with CUDA RNG, but not CPU RNG. One solution is to add CPU seeds to the seed manager as well.
I recommend using an init context to solve the problem rather than changing the colossal.nn functionality.
ZeRO init context provides an arg as target_device to designate the device to init parameter tensors.
I suggest such a context is not only be used for ZeRO but should be used for all parallel strategies, including TP.
I recommend using an init context to solve the problem rather than changing the
colossal.nnfunctionality. ZeRO init context provides an arg astarget_deviceto designate the device to init parameter tensors. I suggest such a context is not only be used for ZeRO but should be used for all parallel strategies, including TP.
I think a context will be good, but I wonder whether it can achieve the desired randomicity. For examples, a tensor is initialized under both with seed(ParallelMode.TENSOR) and with ZeroInitContext(), will they be initialized to different parameters on different ranks if device is cpu?
We have updated a lot. This issue was closed due to inactivity. Thanks. https://github.com/hpcaitech/ColossalAI/discussions/3124