顾孙炎 comments

Results 4 comments of


                                            顾孙炎

CPT pretrain problem

> I'm sure that I use transformers to load roberta_zh.But the model which I download have error parameter name. Can you give me a link to download right roberta_zh?

> 正常的，因为目前代码里设置的是 `device_map="auto"`，在多卡状态下会自动触发 model parallel 模型并行，就是把模型的多个层分配到不同的 GPU 上，从而可以节省显存、加大batch。我自己测的时候，发现它会比 data parallel 更快。 > > 如果想改成 data parallel，则将 device_map="auto" 改为：`device_map={'':torch.cuda.current_device()}` 你好，我使用model parallel训练，训练了29step就卡住了，gpu运行0%，cpu运行100%。请问这种情况你碰到过吗？

多卡训练感觉不是并发的?

> 暂时没碰到。有什么报错吗没有报错，目前来看应该是data_collator有问题，因为chatglm可以正常跑，百川用的DataCollatorForLanguageModeling会卡住。我用的是V100。

多卡训练感觉不是并发的?

> > > 正常的，因为目前代码里设置的是 `device_map="auto"`，在多卡状态下会自动触发 model parallel 模型并行，就是把模型的多个层分配到不同的 GPU 上，从而可以节省显存、加大batch。我自己测的时候，发现它会比 data parallel 更快。 > > > 如果想改成 data parallel，则将 device_map="auto" 改为：`device_map={'':torch.cuda.current_device()}` > > > > > > 你好，我使用model parallel训练，训练了29step就卡住了，gpu运行0%，cpu运行100%。请问这种情况你碰到过吗？ >...