Silentssss

Results 13 comments of Silentssss

参数写错了,以下为执行命令 `torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 train.py --depth=16 --bs=384 --ep=200 --fp16=1 --alng=1e-3 --wpe=0.1 --afuse=False`

但是训练时如果走xformers的memory_efficient_attention也是报错的,q、k的dtype为float32,v为float16 ![image](https://github.com/FoundationVision/VAR/assets/108161275/4f972ac2-1c03-448a-b4f7-92b8090faad7)

> 我也遇到了同样的问题,在deepspeed issues中有找到相关说明[https://github.com/microsoft/DeepSpeed/issues/3234,ZeRO](https://github.com/microsoft/DeepSpeed/issues/3234%EF%BC%8CZeRO) stage 3支持zero.init,stage 1和2不支持,我把deepspeed.json中stage改成3解决了这个问题 ![image](https://github.com/baichuan-inc/Baichuan-7B/assets/108161275/00667787-3e62-4d38-a3cb-58028ab740ea) 按你的方法修改后有新的报错,你有遇到吗

log间隔大感觉loss也不应该跳动这么大的幅度,看到您给的默认参数是2p,我改到8p了,可能相应的学习率和batch_size也需要改下,我修改下再训练试试

您这是2p训练出来的结果吗,我这里8p的时候loss都已经跳到1以上了

这个参数,我这里是一台A100,有8个device,你们给的参数默认是2个device同时训练 ![image](https://github.com/NUS-HPC-AI-Lab/OpenDiT/assets/108161275/64c11101-cd6a-4bc3-bddf-779f4e8eb62d)

那请问有单机的参数吗,比如单机2卡或者单机8卡训练的参数,因为我这里目前只有一台机器

好的,感谢感谢

It occurred an error when I used LibriSpeech dataset to train. That is my error and command: ![image](https://user-images.githubusercontent.com/108161275/203016230-93544b7b-e85f-4bc7-b997-4e5562661698.png) ![image](https://user-images.githubusercontent.com/108161275/203015064-e6bae0a9-b504-4a55-8da0-91d9ed363262.png) ![image](https://user-images.githubusercontent.com/108161275/203015823-7f876702-08e6-45da-8712-2e150d35d13c.png)

> Hi, I believe that pynd was reorganized sometime in the past few years, and pynd.ndutils likely no longer exists. You might need to search for the function you want...