[BUG]: timed out when using 64 GPUs.
🐛 Describe the bug
I am experimenting gemini, the code runs fine when using only 16 GPUs or less on a single machine. But if I use 64 GPUs, it errors with timed out.
This is the error:
Traceback (most recent call last):
File "train.py", line 209, in --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6452) of binary: /data/miniconda3/envs/env-3.8.8/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 305.97832345962524 seconds
Traceback (most recent call last):
File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 906, in _exit_barrier
store_util.barrier(
File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
agent_data = get_all(store, rank, key_prefix, world_size)
File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Traceback (most recent call last):
File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
============================================================ train.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-01-19_02:38:07 host : ee39c596-a0e2-439b-969a-c3cd3b647981 rank : 28 (local_rank: 0) exitcode : 1 (pid: 6452) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Environment
No response
Hi. Are you using an example code? or you are trying to run your own project. If it's the later case, could you please provide more code, such as launch code.
@gouchangjiang I use this script https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/gpt/gemini/run_gemini.sh with my own DataLoader.
the gpc config is:
BATCH_SIZE = 4 WARMUP_STEPS = 1000 TOTAL_STEPS = 2e+8
SEQ_LEN = 1024 HIDDEN_SIZE = 5120 VOCAB_SIZE = 35693 NUM_LAYERS = 40 NUM_ATTENTION_HEADS = 32
As far as I know, this script is for running Gemini on a node, that's what '--standalone' means. Did you modify it to adapt to multi-node?
Hi, may I know your start command?
met same issue when run examples/language/gpt/gemini/run_gemini.sh , have you solved this ? @bestbzw
We have updated a lot. This issue was closed due to inactivity. If you have similar bugs, please open a new issue. Thanks.