Silentssss comments

Results 13 comments of


                                            Silentssss

flash-atten相关问题

参数写错了，以下为执行命令 `torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 train.py --depth=16 --bs=384 --ep=200 --fp16=1 --alng=1e-3 --wpe=0.1 --afuse=False`

flash-atten相关问题

但是训练时如果走xformers的memory_efficient_attention也是报错的，q、k的dtype为float32，v为float16 ![image](https://github.com/FoundationVision/VAR/assets/108161275/4f972ac2-1c03-448a-b4f7-92b8090faad7)

> 我也遇到了同样的问题，在deepspeed issues中有找到相关说明[https://github.com/microsoft/DeepSpeed/issues/3234，ZeRO](https://github.com/microsoft/DeepSpeed/issues/3234%EF%BC%8CZeRO) stage 3支持zero.init，stage 1和2不支持，我把deepspeed.json中stage改成3解决了这个问题 ![image](https://github.com/baichuan-inc/Baichuan-7B/assets/108161275/00667787-3e62-4d38-a3cb-58028ab740ea) 按你的方法修改后有新的报错，你有遇到吗

训练出的loss不收敛

log间隔大感觉loss也不应该跳动这么大的幅度，看到您给的默认参数是2p，我改到8p了，可能相应的学习率和batch_size也需要改下，我修改下再训练试试

训练出的loss不收敛

您这是2p训练出来的结果吗，我这里8p的时候loss都已经跳到1以上了

训练出的loss不收敛

这个参数，我这里是一台A100，有8个device，你们给的参数默认是2个device同时训练 ![image](https://github.com/NUS-HPC-AI-Lab/OpenDiT/assets/108161275/64c11101-cd6a-4bc3-bddf-779f4e8eb62d)

训练出的loss不收敛

那请问有单机的参数吗，比如单机2卡或者单机8卡训练的参数，因为我这里目前只有一台机器

训练出的loss不收敛

好的，感谢感谢

How to download KsponSpeech dataset on AIHhub net?

It occurred an error when I used LibriSpeech dataset to train. That is my error and command: ![image](https://user-images.githubusercontent.com/108161275/203016230-93544b7b-e85f-4bc7-b997-4e5562661698.png) ![image](https://user-images.githubusercontent.com/108161275/203015064-e6bae0a9-b504-4a55-8da0-91d9ed363262.png) ![image](https://user-images.githubusercontent.com/108161275/203015823-7f876702-08e6-45da-8712-2e150d35d13c.png)

ModuleNotFoundError: No module named 'pynd.ndutils'

> Hi, I believe that pynd was reorganized sometime in the past few years, and pynd.ndutils likely no longer exists. You might need to search for the function you want...