VLMEvalKit Distributed evaluation of Spatial457

I am trying to evaluate llava_v1.5_7b model on Spatial457 benchmark on a 8 GPU system, but it is only using 1 GPU. Below command

python run.py --data Spatial457 --model llava_v1.5_7b

How can I do multi GPU eval?

Jul 07 '25 01:07 nahidalam

Hi, you can run the following command:

bash scripts/run.sh --data Spatial457 --model llava_v1.5_7b

It will detect 8 GPUs and run one model instance on each GPU, thus the evaluation would be 8x faster than the python launcher.

Jul 07 '25 06:07 kennymckormick

@kennymckormick The above command did not work for Spatial457 dataset but worked for MME.

Essentially, the following works for MME:

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1)
export GPU=$gpu_count

cd /VLMEvalKit
torchrun --nproc-per-node=$gpu_count run.py --data MME --model llava_v1.5_7b

It works with Spatial457 after setting NCCL_P2P_DISABLE (Ping @nahidalam )

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1)
export GPU=$gpu_count
export NCCL_P2P_DISABLE=1

cd /VLMEvalKit
torchrun --nproc-per-node=$gpu_count run.py --data Spatial457 --model llava_v1.5_7b

Without that, for Spatial457, it throws an error

RANK: 1, LOCAL_RANK: 1, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 1
RANK: 0, LOCAL_RANK: 0, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 0
RANK: 2, LOCAL_RANK: 2, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 2
RANK: 3, LOCAL_RANK: 3, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 3
[2025-07-09 02:46:28] WARNING - RUN - run.py: main - 216: --reuse is not set, will not reuse previous (before one day) temporary files
True
True
True
True
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
[2025-07-09 02:46:35,906] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 31) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run.py FAILED
------------------------------------------------------------

Jul 09 '25 11:07 KranthiGV

@kennymckormick The above command did not work for Spatial457 dataset but worked for MME.

Essentially, the following works for MME:

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1) export GPU=$gpu_count

cd /VLMEvalKit torchrun --nproc-per-node=$gpu_count run.py --data MME --model llava_v1.5_7b It works with Spatial457 after setting NCCL_P2P_DISABLE (Ping @nahidalam )

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1) export GPU=$gpu_count export NCCL_P2P_DISABLE=1

cd /VLMEvalKit torchrun --nproc-per-node=$gpu_count run.py --data Spatial457 --model llava_v1.5_7b Without that, for Spatial457, it throws an error

RANK: 1, LOCAL_RANK: 1, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 1
RANK: 0, LOCAL_RANK: 0, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 0
RANK: 2, LOCAL_RANK: 2, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 2
RANK: 3, LOCAL_RANK: 3, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 3
[2025-07-09 02:46:28] WARNING - RUN - run.py: main - 216: --reuse is not set, will not reuse previous (before one day) temporary files
True
True
True
True
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
[2025-07-09 02:46:35,906] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 31) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run.py FAILED
------------------------------------------------------------

That sounds weird, using different datasets should not cause that difference. Did you run the evaluation of two benchmarks under exactly the same environment?

Jul 09 '25 13:07 kennymckormick

@kennymckormick

Sorry for the confusion. Turns out I tried Spatial457 on a different type of GPU due to OOM error.

In multi-gpu setting,

GPU	Notes
T4	MME works; Spatial457 OOM
A10G	Both work with P2P transport disabled
A100	Both work as is

Is disabling NCCL P2P not recommended?

Jul 10 '25 01:07 KranthiGV

@KranthiGV just tested on A100 GPU. It works without any additional setup

torchrun --nproc-per-node=8 run.py --data Spatial457 --model llava_v1.5_7b

Will take 24h to finish it in 8xA100 80GB :)

Jul 10 '25 02:07 nahidalam

okay I did not set the NCCL P2P and it failed after 20 hours 😿

Infer llava_v1.5_7b/Spatial457, Rank 4/8:  78%|███████▊  | 2328/2969 [17:20:03<3:34:59, 20.12s/it][E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600646 milliseconds before timing out.
[2025-07-10 20:08:06] ERROR - RUN - run.py: main - 505: Model llava_v1.5_7b x Dataset Spatial457 combination failed: , skipping this combination.
Traceback (most recent call last):
  File "/home/ubuntu/VLMEvalKit/run.py", line 373, in main
    model = infer_data_job(
  File "/home/ubuntu/VLMEvalKit/vlmeval/inference.py", line 216, in infer_data_job
    assert x in data_all
AssertionError
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600646 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600646 milliseconds before timing out.
Infer llava_v1.5_7b/Spatial457, Rank 4/8:  78%|███████▊  | 2330/2969 [17:21:06<4:08:13, 23.31s/it][2025-07-10 20:09:17,978] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5051 closing signal SIGTERM
[2025-07-10 20:09:17,978] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5052 closing signal SIGTERM
[2025-07-10 20:09:17,978] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5053 closing signal SIGTERM
[2025-07-10 20:09:17,979] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5054 closing signal SIGTERM
[2025-07-10 20:09:17,979] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5055 closing signal SIGTERM
[2025-07-10 20:09:17,979] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5057 closing signal SIGTERM
[2025-07-10 20:09:19,084] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 6 (pid: 5056) of binary: /opt/conda/envs/vlmevalkit/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/envs/vlmevalkit/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
run.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-07-10_20:09:17
  host      : ip-172-31-21-134.ec2.internal
  rank      : 6 (local_rank: 6)
  exitcode  : -6 (pid: 5056)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 5056
=====================================================

Jul 10 '25 20:07 nahidalam

I guess this happened because I did not do export NCCL_P2P_DISABLE=1

Jul 10 '25 20:07 nahidalam

This looks like a different error. We hit the following timeout:

if WORLD_SIZE > 1:
        import torch.distributed as dist
        dist.init_process_group(
            backend='nccl',
            timeout=datetime.timedelta(seconds=int(os.environ.get('DIST_TIMEOUT', 3600)))
        )

I'm guessing one of the GPUs was stuck and so we couldn't synchronize. But since VLMEvalKit seems to be using files to actually "synchronize data", I think it should be safe to disable its p2p communication

Jul 10 '25 23:07 KranthiGV

after setting export NCCL_P2P_DISABLE=1, it seems to be failing faster.

All the steps below:

export NCCL_P2P_DISABLE=1

To run it in the background

nohup torchrun --nproc-per-node=8 run.py --data Spatial457 --model llava_v1.5_7b > ncclrun_spatial.log 2>&1 &
disown %1

This seems to be failing faster than before

/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/huggingface_hub/file_download.py:943: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  6.01it/s]
/home/ubuntu/VLMEvalKit/vlmeval/vlm/llava/llava.py:77: UserWarning: Following kwargs received: {'do_sample': False, 'temperature': 0, 'max_new_tokens': 2048, 'top_p': None, 'num_beams': 1, 'use_cache': True}, will use as generation config.
  warnings.warn(
Infer llava_v1.5_7b/Spatial457, Rank 4/8:   0%|          | 0/639 [00:00<?, ?it/s]/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:392: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:397: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `None` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
Infer llava_v1.5_7b/Spatial457, Rank 4/8:  29%|██▊       | 183/639 [59:31<58:29,  7.70s/it]  [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600018 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600041 milliseconds before timing out.
[2025-07-10 21:46:31] ERROR - RUN - run.py: main - 505: Model llava_v1.5_7b x Dataset Spatial457 combination failed: , skipping this combination.
Traceback (most recent call last):
  File "/home/ubuntu/VLMEvalKit/run.py", line 373, in main
    model = infer_data_job(
  File "/home/ubuntu/VLMEvalKit/vlmeval/inference.py", line 216, in infer_data_job
    assert x in data_all
AssertionError
[E ProcessGroupNCCL.cpp:475] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600365 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600426 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600594 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600426 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600426 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600365 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600365 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600758 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600018 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600018 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600041 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600041 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600802 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600802 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600802 milliseconds before timing out.
Spatial457
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600758 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600758 milliseconds before timing out.
Spatial457
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600594 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600594 milliseconds before timing out.
[2025-07-10 21:46:37,711] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 32052 closing signal SIGTERM
[2025-07-10 21:46:37,711] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 32054 closing signal SIGTERM
[2025-07-10 21:46:37,711] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 32056 closing signal SIGTERM
[2025-07-10 21:46:39,778] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 32050) of binary: /opt/conda/envs/vlmevalkit/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/envs/vlmevalkit/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
run.py FAILED
------------------------------------------------------
Failures:
[1]:
  time      : 2025-07-10_21:46:37
  host      : ip-172-31-21-134.ec2.internal
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 32051)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 32051
[2]:
  time      : 2025-07-10_21:46:37
  host      : ip-172-31-21-134.ec2.internal
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 32053)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 32053
[3]:
  time      : 2025-07-10_21:46:37
  host      : ip-172-31-21-134.ec2.internal
  rank      : 5 (local_rank: 5)
  exitcode  : -6 (pid: 32055)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 32055
[4]:
  time      : 2025-07-10_21:46:37
  host      : ip-172-31-21-134.ec2.internal
  rank      : 7 (local_rank: 7)
  exitcode  : -6 (pid: 32057)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 32057
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-07-10_21:46:37
  host      : ip-172-31-21-134.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 32050)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 32050
======================================================

Jul 11 '25 02:07 nahidalam

Hi, you can run the following command:

bash scripts/run.sh --data Spatial457 --model llava_v1.5_7b

It will detect 8 GPUs and run one model instance on each GPU, thus the evaluation would be 8x faster than the python launcher.

Hi, @kennymckormick. Why is it that after I run the command, only the model is parallelized, but the evaluation is not parallelized and accelerated?

Jul 29 '25 01:07 zghhui

Hi, you can run the following command: bash scripts/run.sh --data Spatial457 --model llava_v1.5_7b It will detect 8 GPUs and run one model instance on each GPU, thus the evaluation would be 8x faster than the python launcher.

Hi, @kennymckormick. Why is it that after I run the command, only the model is parallelized, but the evaluation is not parallelized and accelerated?

hi @zghhui ,did you solve this problem?i also encountered this.

Aug 28 '25 10:08 jeffreylin122