[BUG]: RuntimeError(f"Parameter model.lm.parameters failed at the gradient reduction. " "Some unsupported torch function is operated upon this parameter.")
๐ Describe the bug
strategy:colossal_gemini print info: chunk.tensors_info[p].state TensorState.HOLD TensorState.HOLD_AFTER_BWD, -->so raise error: RuntimeError(f"Parameter model.lm.parameters failed at the gradient reduction. " "Some unsupported torch function is operated upon this parameter.") but I dont know how to solve it.What's wrong here?
Environment
No response
Hi @Youly172 Could you please provide more details to help us to reproduce it? e.g. What's example, command, env, and any changes?
I also encountered the same error. Did you manage to resolve it later on?
้ฎไปถๅทฒๆถๅฐ~ๆๅทง่ณ
Bot detected the issue body's language is not English, translate it automatically. ๐ฏ๐ญ๐ป๐งโ๐คโ๐ง๐ซ๐ง๐ฟโ๐คโ๐ง๐ป๐ฉ๐พโ๐คโ๐จ๐ฟ๐ฌ๐ฟ
The mail has been received~ Li Qiaoyan
I also encountered the same error. Did you manage to resolve it later on?
I'm using version 0.3.0 of colossalai. I encountered a RuntimeError: Parameter "tor_bond_conv.batch_norm.bias" failed at the gradient reduction. Some unsupported torch function is operated upon this parameter. error. In the gemini_plugin.py file, I found a comment mentioning that the support for zero in colossalai is currently not optimal, along with the commented line model = nn.SyncBatchNorm.convert_sync_batchnorm(model, None). I suspected that the issue was caused by the Batch Normalization layers in the model. However, even after uncommenting that line, the error still persists and remains unchanged.