Jeremy-lf
Jeremy-lf
it's too disappointed
hi, thanks for your reply . the above issue has been solved! Now i design a new network, and training it with imagenet dataset, but i have encounter a new...
> 你好,看下/usr/local/Ascend目录下asend的报错log,看下有什么报错信息 目前这个问题是,只训练不评测没问题,但是中间评测的话,他就会报这个错,好像是训练与评测之间切换的问题。你说的那个目录下没有找到相应的log
> > > 你好,看下/usr/local/Ascend目录下asend的报错log,看下有什么报错信息 > > > > > > 目前这个问题是,只训练不评测没问题,但是中间评测的话,他就会报这个错,好像是训练与评测之间切换的问题。你说的那个目录下没有找到相应的log > > 目录刚刚给错了,目录为:/root/ascend/log/debug/plog/;可以把之前的plog都删掉,测试下单独跑评估是否会报错,如果出现一样的错误,cd /root/ascend/log/debug/plog/ && grep ERROR * -C 20,看下相关的报错信息 单独评测不会报错,只有在训练中评测会报错,报的错就是图里那个
> > when I use zero3_offload.json it reported wrong > > ``` > > [2024-01-12 06:48:37,198] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1201 > > [2024-01-12 06:48:37,199] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/bin/python', '-u', 'llava/train/train.py',...