ColossalAI [BUG]: RuntimeError(f"Parameter model.lm.parameters failed at the gradient reduction. " "Some unsupported torch function is operated upon this parameter.")

🐛 Describe the bug

strategy:colossal_gemini print info: chunk.tensors_info[p].state TensorState.HOLD TensorState.HOLD_AFTER_BWD, -->so raise error: RuntimeError(f"Parameter model.lm.parameters failed at the gradient reduction. " "Some unsupported torch function is operated upon this parameter.") but I dont know how to solve it.What's wrong here?

Environment

No response

Apr 14 '23 17:04 Youly172

Hi @Youly172 Could you please provide more details to help us to reproduce it? e.g. What's example, command, env, and any changes?

Apr 18 '23 07:04 binmakeswell

I also encountered the same error. Did you manage to resolve it later on?

Jul 17 '23 01:07 gaylong9

邮件已收到~李巧艳

Jul 17 '23 01:07 Youly172

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

The mail has been received~ Li Qiaoyan

Jul 17 '23 01:07 Issues-translate-bot

I also encountered the same error. Did you manage to resolve it later on?

I'm using version 0.3.0 of colossalai. I encountered a RuntimeError: Parameter "tor_bond_conv.batch_norm.bias" failed at the gradient reduction. Some unsupported torch function is operated upon this parameter. error. In the gemini_plugin.py file, I found a comment mentioning that the support for zero in colossalai is currently not optimal, along with the commented line model = nn.SyncBatchNorm.convert_sync_batchnorm(model, None). I suspected that the issue was caused by the Batch Normalization layers in the model. However, even after uncommenting that line, the error still persists and remains unchanged.

Jul 17 '23 03:07 gaylong9