Training killed with SIGKILL on restarting from checkpoint using FSDP model
❓ Questions and Help
What is your question?
I've pretrained a 13B GPT3 model with FSDP following this guide. However, whenever I try and finetune it, providing the checkpoint as the starting point, the job is killed with the following kill message.
2021-09-28 20:03:28 | INFO | fairseq.trainer | Preparing to load checkpoint /workspaceblobstore/shubham/experiments/bigger/gptx.13B.dawn/checkpoint_last.pt
Traceback (most recent call last):
File "/home/schandel/.pyenv/versions/anaconda3-2020.11/bin/fairseq-train", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
File "/workspaceblobstore/shubham/fairseq/fairseq_cli/train.py", line 507, in cli_main
distributed_utils.call_main(cfg, main)
File "/workspaceblobstore/shubham/fairseq/fairseq/distributed/utils.py", line 344, in call_main
torch.multiprocessing.spawn(
File "/home/schandel/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/schandel/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/schandel/.pyenv/versions/anaconda3-2020.11/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with signal SIGKILL
Code
export OMP_NUM_THREADS=20
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15"
fairseq-train $DATADIR \
--arch transformer_lm_gpt3_13 \
--restore-file /workspaceblobstore/shubham/experiments/bigger/gptx.13B.dawn/checkpoint_last.pt \
--task language_modeling --tokens-per-sample 2048 --batch-size 8 \
--ddp-backend fully_sharded \
--fp16 --fp16-init-scale 4 \
--cpu-offload --checkpoint-activations \
--optimizer cpu_adam --adam-betas "(0.9,0.98)" \
--lr 0.00009 --lr-scheduler polynomial_decay --warmup-updates 5 --total-num-update 10 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-update 10 --no-save --log-format json --log-interval 1 \
--dropout 0.1 --relu-dropout 0.1 --attention-dropout 0.1 \
--layernorm-embedding \
What's your environment?
- fairseq Version (1.0.0a0+c1624b2)
- PyTorch Version (1.9.0+cu111)
- OS (e.g., Linux): Ubuntu
- How you installed fairseq (
pip, source): source - Python version: 3.8.5
- CUDA/cuDNN version: 11.0
- GPU models and configuration: Tesla V100 32B
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Hello, have you solved this problem? What is the solution to it?
I still have the same problem, is it solved?