fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Multi-GPU training doesn't work when `--user-dir` specified

Open st-vincent1 opened this issue 3 years ago • 1 comments

🐛 Bug

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

  1. Run training for any config, e.g.
CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train fairseq/data-bin/mydatabin \
    --max-update 1000000 \
    --patience 5 \ 
    --ddp-backend=legacy_ddp \
    --task translation \
    --arch transformer \
    --optimizer adam --adam-betas '(0.9, 0.98)' \
    --lr-scheduler inverse_sqrt \
    --warmup-updates 4000 --warmup-init-lr '1e-07' \
    --label-smoothing 0.1 --criterion label_smoothed_cross_entropy \
    --dropout 0.3 --weight-decay 0.0001 \
    --no-epoch-checkpoints \
    --save-dir fairseq/checkpoints/x \
    --max-tokens 5000 --update-freq 2 \ 
    --memory-efficient-fp16 \
    --seed 1 \ 
    --clip-norm 5.0 \
    --lr 0.0003 

Training starts and proceeds correctly.

Now add --user-dir:

CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train fairseq/data-bin/mydatabin \
    --user-dir myuserdir \
    --max-update 1000000 \
    --patience 5 \ 
    --ddp-backend=legacy_ddp \
    ...

I get the following output:

2022-11-21 10:46:37 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    exitcode = _main(fd, parent_sentinel)
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    exitcode = _main(fd, parent_sentinel)
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
    prepare(preparation_data)
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    prepare(preparation_data)
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/runpy.py", line 269, in run_path
    main_content = runpy.run_path(main_path,
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/runpy.py", line 269, in run_path
    main_content = runpy.run_path(main_path,
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/runpy.py", line 269, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/runpy.py", line 96, in _run_module_code
    return _run_module_code(code, init_globals, run_name,
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/runpy.py", line 96, in _run_module_code
    return _run_module_code(code, init_globals, run_name,
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/runpy.py", line 86, in _run_code
    _run_code(code, mod_globals, init_globals,
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/runpy.py", line 86, in _run_code
    _run_code(code, mod_globals, init_globals,
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
    exec(code, run_globals)
  File "/home/ph/miniconda3/envs/dcs_a100/bin/fairseq-train", line 5, in <module>
  File "/home/ph/miniconda3/envs/dcs_a100/bin/fairseq-train", line 5, in <module>
    exec(code, run_globals)
  File "/home/ph/miniconda3/envs/dcs_a100/bin/fairseq-train", line 5, in <module>
    from fairseq_cli.train import cli_main
    from fairseq_cli.train import cli_main
    from fairseq_cli.train import cli_main
  File "/mnt/ph/ph/ph/fairseq/fairseq_cli/train.py", line 30, in <module>
  File "/mnt/ph/ph/ph/fairseq/fairseq_cli/train.py", line 30, in <module>
  File "/mnt/ph/ph/ph/fairseq/fairseq_cli/train.py", line 30, in <module>
Traceback (most recent call last):
    from fairseq import checkpoint_utils, options, quantization_utils, tasks, utils
    from fairseq import checkpoint_utils, options, quantization_utils, tasks, utils
  File "/mnt/ph/ph/ph/fairseq/fairseq/checkpoint_utils.py", line 29, in <module>
    from fairseq import checkpoint_utils, options, quantization_utils, tasks, utils
  File "/mnt/ph/ph/ph/fairseq/fairseq/checkpoint_utils.py", line 29, in <module>
  File "/mnt/ph/ph/ph/fairseq/fairseq/checkpoint_utils.py", line 29, in <module>
  File "<string>", line 1, in <module>
    from fairseq.models import FairseqDecoder, FairseqEncoder
    from fairseq.models import FairseqDecoder, FairseqEncoder
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/__init__.py", line 235, in <module>
    from fairseq.models import FairseqDecoder, FairseqEncoder
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/__init__.py", line 235, in <module>
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/__init__.py", line 235, in <module>
    import_models(models_dir, "fairseq.models")
    import_models(models_dir, "fairseq.models")
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/__init__.py", line 217, in import_models
    import_models(models_dir, "fairseq.models")
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/__init__.py", line 217, in import_models
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/__init__.py", line 217, in import_models
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    importlib.import_module(namespace + "." + model_name)
    importlib.import_module(namespace + "." + model_name)
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/importlib/__init__.py", line 126, in import_module
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/importlib/__init__.py", line 126, in import_module
    importlib.import_module(namespace + "." + model_name)
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/hubert/__init__.py", line 6, in <module>
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/hubert/__init__.py", line 6, in <module>
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/hubert/__init__.py", line 6, in <module>
    from .hubert import *  # noqa
    from .hubert import *  # noqa
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/hubert/hubert.py", line 20, in <module>
    from .hubert import *  # noqa
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/hubert/hubert.py", line 20, in <module>
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/hubert/hubert.py", line 20, in <module>
    from fairseq.models.wav2vec.wav2vec2 import (
    from fairseq.models.wav2vec.wav2vec2 import (
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/wav2vec/__init__.py", line 6, in <module>
    from fairseq.models.wav2vec.wav2vec2 import (
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/wav2vec/__init__.py", line 6, in <module>
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/wav2vec/__init__.py", line 6, in <module>
    from .wav2vec import *  # noqa
    from .wav2vec import *  # noqa
    from .wav2vec import *  # noqa
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/wav2vec/wav2vec.py", line 25, in <module>
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/wav2vec/wav2vec.py", line 25, in <module>
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/wav2vec/wav2vec.py", line 25, in <module>
    from fairseq.tasks import FairseqTask
    from fairseq.tasks import FairseqTask
  File "/mnt/ph/ph/ph/fairseq/fairseq/tasks/__init__.py", line 15, in <module>
    from fairseq.tasks import FairseqTask
  File "/mnt/ph/ph/ph/fairseq/fairseq/tasks/__init__.py", line 15, in <module>
  File "/mnt/ph/ph/ph/fairseq/fairseq/tasks/__init__.py", line 15, in <module>
    exitcode = _main(fd, parent_sentinel)
    from .fairseq_task import FairseqTask, LegacyFairseqTask  # noqa
    from .fairseq_task import FairseqTask, LegacyFairseqTask  # noqa
  File "/mnt/ph/ph/ph/fairseq/fairseq/tasks/fairseq_task.py", line 13, in <module>
    from .fairseq_task import FairseqTask, LegacyFairseqTask  # noqa
  File "/mnt/ph/ph/ph/fairseq/fairseq/tasks/fairseq_task.py", line 13, in <module>
  File "/mnt/ph/ph/ph/fairseq/fairseq/tasks/fairseq_task.py", line 13, in <module>
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    from fairseq import metrics, search, tokenizer, utils
    from fairseq import metrics, search, tokenizer, utils
    from fairseq import metrics, search, tokenizer, utils
ImportError: cannot import name 'metrics' from 'fairseq' (unknown location)
ImportError: cannot import name 'metrics' from 'fairseq' (unknown location)
ImportError: cannot import name 'metrics' from 'fairseq' (unknown location)
    prepare(preparation_data)
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/runpy.py", line 269, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/ph/miniconda3/envs/dcs_a100/bin/fairseq-train", line 5, in <module>
    from fairseq_cli.train import cli_main
  File "/mnt/ph/ph/ph/fairseq/fairseq_cli/train.py", line 30, in <module>
    from fairseq import checkpoint_utils, options, quantization_utils, tasks, utils
  File "/mnt/ph/ph/ph/fairseq/fairseq/checkpoint_utils.py", line 29, in <module>
    from fairseq.models import FairseqDecoder, FairseqEncoder
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/__init__.py", line 235, in <module>
    import_models(models_dir, "fairseq.models")
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/__init__.py", line 217, in import_models
    importlib.import_module(namespace + "." + model_name)
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/hubert/__init__.py", line 6, in <module>
    from .hubert import *  # noqa
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/hubert/hubert.py", line 20, in <module>
    from fairseq.models.wav2vec.wav2vec2 import (
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/wav2vec/__init__.py", line 6, in <module>
    from .wav2vec import *  # noqa
  File "/mnt/ph/ph/ph/fairseq/fairseq/models/wav2vec/wav2vec.py", line 25, in <module>
    from fairseq.tasks import FairseqTask
  File "/mnt/ph/ph/ph/fairseq/fairseq/tasks/__init__.py", line 15, in <module>
    from .fairseq_task import FairseqTask, LegacyFairseqTask  # noqa
  File "/mnt/ph/ph/ph/fairseq/fairseq/tasks/fairseq_task.py", line 13, in <module>
    from fairseq import metrics, search, tokenizer, utils
ImportError: cannot import name 'metrics' from 'fairseq' (unknown location)
Traceback (most recent call last):
  File "/home/ph/miniconda3/envs/dcs_a100/bin/fairseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/mnt/ph/ph/ph/fairseq/fairseq_cli/train.py", line 557, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/mnt/ph/ph/ph/fairseq/fairseq/distributed/utils.py", line 344, in call_main
    torch.multiprocessing.spawn(
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/ph/miniconda3/envs/dcs_a100/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 149, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1
srun: error: gpu-node005: task 0: Exited with exit code 1

Expected behavior

I'd expect the multi-gpu training to run the same when --user-dir is specified, i.e. for custom code. Note that specifying the --user-dir causes this error even if the training code does not use any modules specified under that directory.

Environment

  • fairseq Version (e.g., 1.0 or main): main
  • PyTorch Version (e.g., 1.0) 1.12
  • OS (e.g., Linux): CentOS 7
  • How you installed fairseq (pip, source): --editable
  • Build command you used (if compiling from source): pip install --editable .
  • Python version: 3.10
  • CUDA/cuDNN version: 11.6
  • GPU models and configuration: A100 x4
  • Any other relevant information:

st-vincent1 avatar Nov 21 '22 10:11 st-vincent1

Update: what seems to have worked is moving my user directory myuserdir to fairseq/examples or creating a symlink. For some reason fairseq+DDP does not like when the user dir is outside the fairseq dir itself. Any chance this could be fixed?

st-vincent1 avatar Nov 21 '22 23:11 st-vincent1