[BUG]:GeminiDDP: "ZERO DDP error: the synchronization of gradients doesn't exit properly.", 'The most possible reason is that the model is not compatible with ZeroDDP
🐛 Describe the bug
I got an error when I trained Bert large with GeminiDDP: Error location >> self.optimizer.backward(loss) error message:RuntimeError: ("ZERO DDP error: the synchronization of gradients doesn't exit properly.", 'The most possible reason is that the model is not compatible with ZeroDDP.\n', 'Reduction failed at followed parameters:\n\tbert.embeddings.word_embeddings.weight\n\tbert.embeddings.position_embeddings.weight\n\tbert.embeddings.token_type_embeddings.weight\n\tbert.embeddings.LayerNorm.weigh t\n\tbert.embeddings.LayerNorm.bias\n\tbert.encoder.layer.0.attention.self.query.weight\n\tbert.encoder.layer.0.attention.self.query.bias\n\tbert.encoder.layer.0.attention.self.key.w eight\n\tbert.encoder.layer.0.attention.self.key.bias\n\tbert.encoder.layer.0.attention.self.value.weight\n\tbert.encoder.layer.0.attention.self.value.bias\n.......
Code that may be involved:
with ColoInitContext(device=init_dev):
model_config = BertConfig.from_pretrained(args.model_name_or_path, pool_type=args.pool_type)
model = DenseModel.from_pretrained(args.model_name_or_path, config=model_config)
# enable graident checkpointing
model.gradient_checkpointing_enable()
PLACEMENT_POLICY = 'cpu'
cai_version = colossalai.__version__
model = GeminiDDP(model, device=get_current_device(), placement_policy=PLACEMENT_POLICY, pin_memory=True)
and:
def get_optimizer(model, args):
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
'weight_decay': 0.01},
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = HybridAdam(optimizer_grouped_parameters, lr=args.learning_rate)
optimizer = ZeroOptimizer(optimizer, model, initial_scale=2**14)
lr_scheduler = get_scheduler(
name=args.lr_scheduler_type,
optimizer=optimizer,
num_warmup_steps=args.train_steps*args.warmup,
num_training_steps=args.train_steps,
)
return optimizer, lr_scheduler
Then how should I modify it? Thanks.
Environment
No response
i met the same error, did you solve it
i met the same error, did you solve it
No, not yet.
I met the same error with Mixtral-8x7B-v0.1
File "pretrain.py", line 223, in main
booster.backward(loss, optimizer)
File ".local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 167, in backward
optimizer.backward(loss)
File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_optimizer.py", line 291, in backward
self.module.backward(loss)
File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 331, in backward
self._post_backward()
File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 314, in _post_backward
raise RuntimeError(
RuntimeError: ("ZERO DDP error: the synchronization of gradients doesn't exit properly.", 'The most possible reason is that the model is not compatible with GeminiDDP.\n', 'Reduction failed at followed parameters:\n\tmodel.layers.22.block_sparse_moe.experts.2.w1.weight\n\tmodel.layers.22.block_sparse_moe.experts.2.w2.weight\n\tmodel.layers.22.block_sparse_moe.experts.2.w3.weight\n\tmodel.layers.22.block_sparse_moe.experts.3.w1.weight\n\tmodel.layers.22.block_sparse_moe.experts.3.w2.weight\n\tmodel.layers.22.block_sparse_moe.experts.3.w3.weight\n\tmodel.layers.22.block_sparse_moe.experts.4.w1.weight\n\tmodel.layers.22.block_sparse_moe.experts.4.w2.weight\n\tmodel.layers.22.block_sparse_moe.experts.4.w3.weight\n\tmodel.layers.22.block_sparse_moe.experts.5.w1.weight\n\tmodel.layers.22.block_sparse_moe.experts.5.w2.weight\n\tmodel.layers.22.block_sparse_moe.experts.5.w3.weight\n\tmodel.layers.22.block_sparse_moe.experts.6.w1.weight\n\tmodel.layers.22.block_sparse_moe.experts.6.w2.weight\n\tmodel.layers.22.block_sparse_moe.experts.6.w3.weight\n\tmodel.layers.22.block_sparse_moe.experts.7.w1.weight\n\tmodel.layers.22.block_sparse_moe.experts.7.w2.weight\n\tmodel.layers.22.block_sparse_moe.experts.7.w3.weight\n\tmodel.layers.23.block_sparse_moe.experts.2.w1.weight\n\tmodel.layers.23.block_sparse_moe.experts.2.w2.weight\n\tmodel.layers.23.block_sparse_moe.experts.2.w3.weight\n\tmodel.layers.23.block_sparse_moe.experts.3.w1.weight\n\tmodel.layers.23.block_sparse_moe.experts.3.w2.weight\n\tmodel.layers.23.block_sparse_moe.experts.3.w3.weight\n\tmodel.layers.23.block_sparse_moe.experts.4.w1.weight\n\tmodel.layers.23.block_sparse_moe.experts.4.w2.weight\n\tmodel.layers.23.block_sparse_moe.experts.4.w3.weight\n\tmodel.layers.23.block_sparse_moe.experts.5.w1.weight\n\tmodel.layers.23.block_sparse_moe.experts.5.w2.weight\n\tmodel.layers.23.block_sparse_moe.experts.5.w3.weight\n\tmodel.layers.23.block_sparse_moe.experts.6.w1.weight\n\tmodel.layers.23.block_sparse_moe.experts.6.w2.weight\n\tmodel.layers.23.block_sparse_moe.experts.6.w3.weight\n\tmodel.layers.23.block_sparse_moe.experts.7.w1.weight\n\tmodel.layers.23.block_sparse_moe.experts.7.w2.weight\n\tmodel.layers.23.block_sparse_moe.experts.7.w3.weight\n\tmodel.layers.24.block_sparse_moe.experts.6.w1.weight\n\tmodel.layers.24.block_sparse_moe.experts.6.w2.weight\n\tmodel.layers.24.block_sparse_moe.experts.6.w3.weight\n\tmodel.layers.29.block_sparse_moe.experts.2.w1.weight\n\tmodel.layers.29.block_sparse_moe.experts.2.w2.weight\n\tmodel.layers.29.block_sparse_moe.experts.2.w3.weight\n\tmodel.layers.29.block_sparse_moe.experts.3.w1.weight\n\tmodel.layers.29.block_sparse_moe.experts.3.w2.weight\n\tmodel.layers.29.block_sparse_moe.experts.3.w3.weight\n\tmodel.layers.29.block_sparse_moe.experts.4.w1.weight\n\tmodel.layers.29.block_sparse_moe.experts.4.w2.weight\n\tmodel.layers.29.block_sparse_moe.experts.4.w3.weight\n\tmodel.layers.29.block_sparse_moe.experts.5.w1.weight\n\tmodel.layers.29.block_sparse_moe.experts.5.w2.weight\n\tmodel.layers.29.block_sparse_moe.experts.5.w3.weight\n\tmodel.layers.29.block_sparse_moe.experts.6.w1.weight\n\tmodel.layers.29.block_sparse_moe.experts.6.w2.weight\n\tmodel.layers.29.block_sparse_moe.experts.6.w3.weight\n\tmodel.layers.29.block_sparse_moe.experts.7.w1.weight\n\tmodel.layers.29.block_sparse_moe.experts.7.w2.weight\n\tmodel.layers.29.block_sparse_moe.experts.7.w3.weight\n\tmodel.layers.30.block_sparse_moe.experts.2.w1.weight\n\tmodel.layers.30.block_sparse_moe.experts.2.w2.weight\n\tmodel.layers.30.block_sparse_moe.experts.2.w3.weight\n\tmodel.layers.30.block_sparse_moe.experts.3.w1.weight\n\tmodel.layers.30.block_sparse_moe.experts.3.w2.weight\n\tmodel.layers.30.block_sparse_moe.experts.3.w3.weight\n\tmodel.layers.30.block_sparse_moe.experts.4.w1.weight\n\tmodel.layers.30.block_sparse_moe.experts.4.w2.weight\n\tmodel.layers.30.block_sparse_moe.experts.4.w3.weight\n\tmodel.layers.30.block_sparse_moe.experts.5.w1.weight\n\tmodel.layers.30.block_sparse_moe.experts.5.w2.weight\n\tmodel.layers.30.block_sparse_moe.experts.5.w3.weight\n\tmodel.layers.30.block_sparse_moe.experts.6.w1.weight\n\tmodel.layers.30.block_sparse_moe.experts.6.w2.weight\n\tmodel.layers.30.block_sparse_moe.experts.6.w3.weight\n\tmodel.layers.30.block_sparse_moe.experts.7.w1.weight\n\tmodel.layers.30.block_sparse_moe.experts.7.w2.weight\n\tmodel.layers.30.block_sparse_moe.experts.7.w3.weight\n\tmodel.layers.31.block_sparse_moe.experts.2.w1.weight\n\tmodel.layers.31.block_sparse_moe.experts.2.w2.weight\n\tmodel.layers.31.block_sparse_moe.experts.2.w3.weight\n\tmodel.layers.31.block_sparse_moe.experts.3.w1.weight\n\tmodel.layers.31.block_sparse_moe.experts.3.w2.weight\n\tmodel.layers.31.block_sparse_moe.experts.3.w3.weight\n\tmodel.layers.31.block_sparse_moe.experts.4.w1.weight\n\tmodel.layers.31.block_sparse_moe.experts.4.w2.weight\n\tmodel.layers.31.block_sparse_moe.experts.4.w3.weight\n\tmodel.layers.31.block_sparse_moe.experts.5.w1.weight\n\tmodel.layers.31.block_sparse_moe.experts.5.w2.weight\n\tmodel.layers.31.block_sparse_moe.experts.5.w3.weight\n\tmodel.layers.31.block_sparse_moe.experts.6.w1.weight\n\tmodel.layers.31.block_sparse_moe.experts.6.w2.weight\n\tmodel.layers.31.block_sparse_moe.experts.6.w3.weight\n\tmodel.layers.31.block_sparse_moe.experts.7.w1.weight\n\tmodel.layers.31.block_sparse_moe.experts.7.w2.weight\n\tmodel.layers.31.block_sparse_moe.experts.7.w3.weight')