Chudong Tian

Results 17 comments of Chudong Tian

If finish reading a batch of data and then train for a longer time, it will get an error: **"request failed with error Transferred a partial file"** Why is this...

@solomon-ma @Amanda-Barbara , I also encountered the same problem, I changed the batch_size to 4 (the same number as my gpus), still stopped at "Creating test net (#0) specified by...

> 我现在直接使用pipeline测试一下模型的结果,模型直接输出输入的整个context作为答案,好家伙 老铁你解决了吗。。。我也想用pipeline @basketballandlearn

以及用4卡finetuning的gpt3,再用4卡推理的时候,就卡住了,卡在了: Loading extension module scaled_softmax_cuda... Detected CUDA files, patching ldflags Emitting ninja build file /opt/conda/lib/python3.7/site-packages/megatron_util/fused_kernels/build/build.ninja... Building extension module fused_mix_prec_layer_norm_cuda... Allowing ninja to set a default number of workers... (overridable by...

> > 以及用4卡finetuning的gpt3,再用4卡推理的时候,就卡住了,卡在了: Loading extension module scaled_softmax_cuda... Detected CUDA files, patching ldflags Emitting ninja build file /opt/conda/lib/python3.7/site-packages/megatron_util/fused_kernels/build/build.ninja... Building extension module fused_mix_prec_layer_norm_cuda... Allowing ninja to set a default number of workers......

> 训练GPT3显卡显存的要求是 GPT3 1.3B只要V100 -32G的,16g也可以训,但是batchsize要很低

> 我这边训完loss大概在1左右,我的bs是4 我用的是bs只能等于1的版本,bs可以大于1的版本还改了什么吗

以及--fp16 这个参数,如果加上的话,会报一个半精度的错,去掉的话就能够成功训练,loss=0会不会是这个原因? Traceback (most recent call last): File "finetune.py", line 93, in main() File "finetune.py", line 85, in main trainer.train() File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train return inner_training_loop( File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py",...

> 是不是显卡的问题?我用V100就报半精度的错,用3090就没事,真是奇怪,怀疑是算力的问题。你的显卡是什么型号的? 我的也是V100