Distributed evaluation of Spatial457
I am trying to evaluate llava_v1.5_7b model on Spatial457 benchmark on a 8 GPU system, but it is only using 1 GPU. Below command
python run.py --data Spatial457 --model llava_v1.5_7b
How can I do multi GPU eval?
Hi, you can run the following command:
bash scripts/run.sh --data Spatial457 --model llava_v1.5_7b
It will detect 8 GPUs and run one model instance on each GPU, thus the evaluation would be 8x faster than the python launcher.
@kennymckormick The above command did not work for Spatial457 dataset but worked for MME.
Essentially, the following works for MME:
gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1)
export GPU=$gpu_count
cd /VLMEvalKit
torchrun --nproc-per-node=$gpu_count run.py --data MME --model llava_v1.5_7b
It works with Spatial457 after setting NCCL_P2P_DISABLE (Ping @nahidalam )
gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1)
export GPU=$gpu_count
export NCCL_P2P_DISABLE=1
cd /VLMEvalKit
torchrun --nproc-per-node=$gpu_count run.py --data Spatial457 --model llava_v1.5_7b
Without that, for Spatial457, it throws an error
RANK: 1, LOCAL_RANK: 1, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 1
RANK: 0, LOCAL_RANK: 0, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 0
RANK: 2, LOCAL_RANK: 2, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 2
RANK: 3, LOCAL_RANK: 3, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 3
[2025-07-09 02:46:28] WARNING - RUN - run.py: main - 216: --reuse is not set, will not reuse previous (before one day) temporary files
True
True
True
True
Traceback (most recent call last):
File "/VLMEvalKit/run.py", line 515, in <module>
main()
File "/VLMEvalKit/run.py", line 271, in main
dist.barrier()
File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
File "/VLMEvalKit/run.py", line 515, in <module>
main()
File "/VLMEvalKit/run.py", line 271, in main
dist.barrier()
File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
File "/VLMEvalKit/run.py", line 515, in <module>
main()
File "/VLMEvalKit/run.py", line 271, in main
dist.barrier()
File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
File "/VLMEvalKit/run.py", line 515, in <module>
main()
File "/VLMEvalKit/run.py", line 271, in main
dist.barrier()
File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
[2025-07-09 02:46:35,906] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 31) of binary: /usr/local/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 10, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run.py FAILED
------------------------------------------------------------
@kennymckormick The above command did not work for Spatial457 dataset but worked for MME.
Essentially, the following works for MME:
gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1) export GPU=$gpu_count
cd /VLMEvalKit torchrun --nproc-per-node=$gpu_count run.py --data MME --model llava_v1.5_7b It works with Spatial457 after setting NCCL_P2P_DISABLE (Ping @nahidalam )
gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1) export GPU=$gpu_count export NCCL_P2P_DISABLE=1
cd /VLMEvalKit torchrun --nproc-per-node=$gpu_count run.py --data Spatial457 --model llava_v1.5_7b Without that, for Spatial457, it throws an error
RANK: 1, LOCAL_RANK: 1, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 1 RANK: 0, LOCAL_RANK: 0, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 0 RANK: 2, LOCAL_RANK: 2, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 2 RANK: 3, LOCAL_RANK: 3, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 3 [2025-07-09 02:46:28] WARNING - RUN - run.py: main - 216: --reuse is not set, will not reuse previous (before one day) temporary files True True True True Traceback (most recent call last): File "/VLMEvalKit/run.py", line 515, in <module> main() File "/VLMEvalKit/run.py", line 271, in main dist.barrier() File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier work = default_pg.barrier(opts=opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 'peer access is not supported between these two devices' Traceback (most recent call last): File "/VLMEvalKit/run.py", line 515, in <module> main() File "/VLMEvalKit/run.py", line 271, in main dist.barrier() File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier work = default_pg.barrier(opts=opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 'peer access is not supported between these two devices' Traceback (most recent call last): File "/VLMEvalKit/run.py", line 515, in <module> main() File "/VLMEvalKit/run.py", line 271, in main dist.barrier() File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier work = default_pg.barrier(opts=opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 'peer access is not supported between these two devices' Traceback (most recent call last): File "/VLMEvalKit/run.py", line 515, in <module> main() File "/VLMEvalKit/run.py", line 271, in main dist.barrier() File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier work = default_pg.barrier(opts=opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 'peer access is not supported between these two devices' [2025-07-09 02:46:35,906] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 31) of binary: /usr/local/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 10, in <module> sys.exit(main()) File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ run.py FAILED ------------------------------------------------------------
That sounds weird, using different datasets should not cause that difference. Did you run the evaluation of two benchmarks under exactly the same environment?
@kennymckormick
Sorry for the confusion. Turns out I tried Spatial457 on a different type of GPU due to OOM error.
In multi-gpu setting,
| GPU | Notes |
|---|---|
| T4 | MME works; Spatial457 OOM |
| A10G | Both work with P2P transport disabled |
| A100 | Both work as is |
Is disabling NCCL P2P not recommended?
@KranthiGV just tested on A100 GPU. It works without any additional setup
torchrun --nproc-per-node=8 run.py --data Spatial457 --model llava_v1.5_7b
Will take 24h to finish it in 8xA100 80GB :)
okay I did not set the NCCL P2P and it failed after 20 hours 😿
Infer llava_v1.5_7b/Spatial457, Rank 4/8: 78%|███████▊ | 2328/2969 [17:20:03<3:34:59, 20.12s/it][E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600646 milliseconds before timing out.
[2025-07-10 20:08:06] ERROR - RUN - run.py: main - 505: Model llava_v1.5_7b x Dataset Spatial457 combination failed: , skipping this combination.
Traceback (most recent call last):
File "/home/ubuntu/VLMEvalKit/run.py", line 373, in main
model = infer_data_job(
File "/home/ubuntu/VLMEvalKit/vlmeval/inference.py", line 216, in infer_data_job
assert x in data_all
AssertionError
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600646 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600646 milliseconds before timing out.
Infer llava_v1.5_7b/Spatial457, Rank 4/8: 78%|███████▊ | 2330/2969 [17:21:06<4:08:13, 23.31s/it][2025-07-10 20:09:17,978] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5051 closing signal SIGTERM
[2025-07-10 20:09:17,978] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5052 closing signal SIGTERM
[2025-07-10 20:09:17,978] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5053 closing signal SIGTERM
[2025-07-10 20:09:17,979] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5054 closing signal SIGTERM
[2025-07-10 20:09:17,979] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5055 closing signal SIGTERM
[2025-07-10 20:09:17,979] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5057 closing signal SIGTERM
[2025-07-10 20:09:19,084] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 6 (pid: 5056) of binary: /opt/conda/envs/vlmevalkit/bin/python3.10
Traceback (most recent call last):
File "/opt/conda/envs/vlmevalkit/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
run.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-07-10_20:09:17
host : ip-172-31-21-134.ec2.internal
rank : 6 (local_rank: 6)
exitcode : -6 (pid: 5056)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 5056
=====================================================
I guess this happened because I did not do export NCCL_P2P_DISABLE=1
This looks like a different error. We hit the following timeout:
if WORLD_SIZE > 1:
import torch.distributed as dist
dist.init_process_group(
backend='nccl',
timeout=datetime.timedelta(seconds=int(os.environ.get('DIST_TIMEOUT', 3600)))
)
I'm guessing one of the GPUs was stuck and so we couldn't synchronize. But since VLMEvalKit seems to be using files to actually "synchronize data", I think it should be safe to disable its p2p communication
after setting export NCCL_P2P_DISABLE=1, it seems to be failing faster.
All the steps below:
export NCCL_P2P_DISABLE=1
To run it in the background
nohup torchrun --nproc-per-node=8 run.py --data Spatial457 --model llava_v1.5_7b > ncclrun_spatial.log 2>&1 &
disown %1
This seems to be failing faster than before
/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/huggingface_hub/file_download.py:943: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 6.01it/s]
/home/ubuntu/VLMEvalKit/vlmeval/vlm/llava/llava.py:77: UserWarning: Following kwargs received: {'do_sample': False, 'temperature': 0, 'max_new_tokens': 2048, 'top_p': None, 'num_beams': 1, 'use_cache': True}, will use as generation config.
warnings.warn(
Infer llava_v1.5_7b/Spatial457, Rank 4/8: 0%| | 0/639 [00:00<?, ?it/s]/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:392: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
warnings.warn(
/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:397: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `None` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
warnings.warn(
Infer llava_v1.5_7b/Spatial457, Rank 4/8: 29%|██▊ | 183/639 [59:31<58:29, 7.70s/it] [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600018 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600041 milliseconds before timing out.
[2025-07-10 21:46:31] ERROR - RUN - run.py: main - 505: Model llava_v1.5_7b x Dataset Spatial457 combination failed: , skipping this combination.
Traceback (most recent call last):
File "/home/ubuntu/VLMEvalKit/run.py", line 373, in main
model = infer_data_job(
File "/home/ubuntu/VLMEvalKit/vlmeval/inference.py", line 216, in infer_data_job
assert x in data_all
AssertionError
[E ProcessGroupNCCL.cpp:475] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600365 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600426 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600594 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600426 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600426 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600365 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600365 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600758 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600018 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600018 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600041 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600041 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600802 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600802 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600802 milliseconds before timing out.
Spatial457
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600758 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600758 milliseconds before timing out.
Spatial457
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600594 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600594 milliseconds before timing out.
[2025-07-10 21:46:37,711] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 32052 closing signal SIGTERM
[2025-07-10 21:46:37,711] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 32054 closing signal SIGTERM
[2025-07-10 21:46:37,711] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 32056 closing signal SIGTERM
[2025-07-10 21:46:39,778] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 32050) of binary: /opt/conda/envs/vlmevalkit/bin/python3.10
Traceback (most recent call last):
File "/opt/conda/envs/vlmevalkit/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
run.py FAILED
------------------------------------------------------
Failures:
[1]:
time : 2025-07-10_21:46:37
host : ip-172-31-21-134.ec2.internal
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 32051)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 32051
[2]:
time : 2025-07-10_21:46:37
host : ip-172-31-21-134.ec2.internal
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 32053)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 32053
[3]:
time : 2025-07-10_21:46:37
host : ip-172-31-21-134.ec2.internal
rank : 5 (local_rank: 5)
exitcode : -6 (pid: 32055)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 32055
[4]:
time : 2025-07-10_21:46:37
host : ip-172-31-21-134.ec2.internal
rank : 7 (local_rank: 7)
exitcode : -6 (pid: 32057)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 32057
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-07-10_21:46:37
host : ip-172-31-21-134.ec2.internal
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 32050)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 32050
======================================================
Hi, you can run the following command:
bash scripts/run.sh --data Spatial457 --model llava_v1.5_7bIt will detect 8 GPUs and run one model instance on each GPU, thus the evaluation would be 8x faster than the python launcher.
Hi, @kennymckormick. Why is it that after I run the command, only the model is parallelized, but the evaluation is not parallelized and accelerated?
Hi, you can run the following command:
bash scripts/run.sh --data Spatial457 --model llava_v1.5_7bIt will detect 8 GPUs and run one model instance on each GPU, thus the evaluation would be 8x faster than the python launcher.Hi, @kennymckormick. Why is it that after I run the command, only the model is parallelized, but the evaluation is not parallelized and accelerated?
![]()
hi @zghhui ,did you solve this problem?i also encountered this.