ColossalAI [RFC]: Tensor Initialization on Different Devices

Describe the feature

Currently, Colossal-AI requires at least PyTorch 1.8 at this is the lowest version which provides holistic communication operations. However, PyTorch 1.8 does not support directly initialize tensors on GPUs.

import torch
import torch.nn as nn

# allowed
layer = nn.Linear(128, 128)

# NOT allowed
layer = nn.Linear(128, 128, device=torch.cuda.current_device())

The current implementation forces the tensor-parallel layers to be initialized on GPUs by default. This is because that for TP layers, we may want them to initialize to different values on different ranks. This is achieved by forking into a different CUDA RNG states. One example is like below:

with seed(ParallelMode.TENSOR)：
      layer = colossalai.nn.Linear(128, 128) # layer is created on GPUs

For the sake of API consistency, we should initialize the TP layers on CPUs by default if no device is given. This brings two benefits:

colossalai.nn behaves the same as torch.nn as tensors are initialized on CPU by default. TP layers will only be initialized on GPUs when device='cuda'.
This reduces the possibility of CUDA OOM when running pipeline partitioning. When partition the model into pipeline stages, one strategy is to count the number of parameters per layer and split accordingly. If all layers are initialized on GPUs, large models may lead to OOM straight.

The problem of CPU intialization is that we can't initialize them differently on different ranks. This is because we currently control the initialization with CUDA RNG, but not CPU RNG. One solution is to add CPU seeds to the seed manager as well.

Apr 18 '22 07:04 FrankLeeeee

I recommend using an init context to solve the problem rather than changing the colossal.nn functionality. ZeRO init context provides an arg as target_device to designate the device to init parameter tensors. I suggest such a context is not only be used for ZeRO but should be used for all parallel strategies, including TP.

Apr 18 '22 07:04 feifeibear

I recommend using an init context to solve the problem rather than changing the colossal.nn functionality. ZeRO init context provides an arg as target_device to designate the device to init parameter tensors. I suggest such a context is not only be used for ZeRO but should be used for all parallel strategies, including TP.

I think a context will be good, but I wonder whether it can achieve the desired randomicity. For examples, a tensor is initialized under both with seed(ParallelMode.TENSOR) and with ZeroInitContext(), will they be initialized to different parameters on different ranks if device is cpu?

Apr 18 '22 07:04 FrankLeeeee

We have updated a lot. This issue was closed due to inactivity. Thanks. https://github.com/hpcaitech/ColossalAI/discussions/3124

Apr 13 '23 03:04 binmakeswell