ColossalAI
ColossalAI copied to clipboard
[BUG]: ColossalMoE Train: AssertionError: Parameters are expected to have the same dtype `torch.bfloat16`, but got `torch.float32`
🐛 Describe the bug
At the stage of booster initialization, some params have wrong dtype of torch.float32 while the precision is set "bf16", and the optimizer initialzation in booster cannot pass the sanity check of params dtype.
Here is the detailed error info:
The bug can be retrivaled as : self.plugin.configure -> HybridParallelZeroOptimizer -> LowLevelZeroOptimizer -> _sanity_checks
There may be some bugs in HybridParallelModule or MixtralModelPolicy.
My test shell:
NUM_GPU=2
MODEL="path to Mixtral-tiny model"
SEQ_LENGTH=2048
BATCH_SIZE=1
LR=0.00001
# hybrid
# torchrun --standalone --nproc_per_node $NUM_GPU \
colossalai run --nproc_per_node $NUM_GPU --hostfile "hostfile" \
train.py \
--num_epoch 1 \
--model_name $MODEL \
--plugin "hybrid" \
--batch_size $BATCH_SIZE \
--lr $LR \
--zero_stage 1 \
--pp_size 1 \
--dp_size 1 \
--ep_size 2 \
--max_length $SEQ_LENGTH
Environment
CUDA 12.1 torch 2.1.0 Python 3.10.14 colossalai 0.3.6 (main) colossal-moe 1.0.0 transformers 4.36.2
Both @ver217 and I have seen this bug, which appears when pp is off. Will dig more into it