[BUG]: ColossalMoE Train: AssertionError: Parameters are expected to have the same dtype `torch.bfloat16`, but got `torch.float32`

Open Camille7777 opened this issue 1 year ago • 1 comments

🐛 Describe the bug

At the stage of booster initialization, some params have wrong dtype of torch.float32 while the precision is set "bf16", and the optimizer initialzation in booster cannot pass the sanity check of params dtype.

Here is the detailed error info: Screenshot 2024-04-26 at 22 04 58

The bug can be retrivaled as : self.plugin.configure -> HybridParallelZeroOptimizer -> LowLevelZeroOptimizer -> _sanity_checks

There may be some bugs in HybridParallelModule or MixtralModelPolicy.

My test shell:

NUM_GPU=2
MODEL="path to Mixtral-tiny model"
SEQ_LENGTH=2048
BATCH_SIZE=1
LR=0.00001

# hybrid
# torchrun --standalone --nproc_per_node $NUM_GPU \
colossalai run --nproc_per_node $NUM_GPU --hostfile "hostfile" \
    train.py \
    --num_epoch 1 \
    --model_name $MODEL \
    --plugin "hybrid" \
    --batch_size $BATCH_SIZE \
    --lr $LR \
    --zero_stage 1 \
    --pp_size 1 \
    --dp_size 1 \
    --ep_size 2 \
    --max_length $SEQ_LENGTH

Environment

CUDA 12.1 torch 2.1.0 Python 3.10.14 colossalai 0.3.6 (main) colossal-moe 1.0.0 transformers 4.36.2

Apr 26 '24 14:04 Camille7777

Both @ver217 and I have seen this bug, which appears when pp is off. Will dig more into it

Apr 27 '24 06:04 Edenzzzz