fairseq Training killed with SIGKILL on restarting from checkpoint using FSDP model

❓ Questions and Help

What is your question?

I've pretrained a 13B GPT3 model with FSDP following this guide. However, whenever I try and finetune it, providing the checkpoint as the starting point, the job is killed with the following kill message.

2021-09-28 20:03:28 | INFO | fairseq.trainer | Preparing to load checkpoint /workspaceblobstore/shubham/experiments/bigger/gptx.13B.dawn/checkpoint_last.pt
Traceback (most recent call last):
  File "/home/schandel/.pyenv/versions/anaconda3-2020.11/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/workspaceblobstore/shubham/fairseq/fairseq_cli/train.py", line 507, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/workspaceblobstore/shubham/fairseq/fairseq/distributed/utils.py", line 344, in call_main
    torch.multiprocessing.spawn(
  File "/home/schandel/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/schandel/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/schandel/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with signal SIGKILL

Code

export OMP_NUM_THREADS=20
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15"
fairseq-train $DATADIR \
    --arch transformer_lm_gpt3_13 \
    --restore-file /workspaceblobstore/shubham/experiments/bigger/gptx.13B.dawn/checkpoint_last.pt \
    --task language_modeling --tokens-per-sample 2048 --batch-size 8 \
    --ddp-backend fully_sharded \
    --fp16 --fp16-init-scale 4 \
    --cpu-offload --checkpoint-activations \
    --optimizer cpu_adam --adam-betas "(0.9,0.98)" \
    --lr 0.00009 --lr-scheduler polynomial_decay --warmup-updates 5 --total-num-update 10 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-update 10 --no-save --log-format json --log-interval 1 \
    --dropout 0.1 --relu-dropout 0.1 --attention-dropout 0.1 \
    --layernorm-embedding \

What's your environment?

fairseq Version (1.0.0a0+c1624b2)
PyTorch Version (1.9.0+cu111)
OS (e.g., Linux): Ubuntu
How you installed fairseq (pip, source): source
Python version: 3.8.5
CUDA/cuDNN version: 11.0
GPU models and configuration: Tesla V100 32B

Sep 28 '21 20:09 sksq96

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

Mar 02 '22 23:03 stale[bot]

Hello, have you solved this problem? What is the solution to it?

Mar 13 '23 09:03 speechless-z

I still have the same problem, is it solved?

Apr 04 '23 10:04 DaliaDawod