VLMEvalKit icon indicating copy to clipboard operation
VLMEvalKit copied to clipboard

Distributed evaluation of Spatial457

Open nahidalam opened this issue 7 months ago • 11 comments

I am trying to evaluate llava_v1.5_7b model on Spatial457 benchmark on a 8 GPU system, but it is only using 1 GPU. Below command

python run.py --data Spatial457 --model llava_v1.5_7b

How can I do multi GPU eval?

nahidalam avatar Jul 07 '25 01:07 nahidalam

Hi, you can run the following command:

bash scripts/run.sh --data Spatial457 --model llava_v1.5_7b

It will detect 8 GPUs and run one model instance on each GPU, thus the evaluation would be 8x faster than the python launcher.

kennymckormick avatar Jul 07 '25 06:07 kennymckormick

@kennymckormick The above command did not work for Spatial457 dataset but worked for MME.

Essentially, the following works for MME:

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1)
export GPU=$gpu_count

cd /VLMEvalKit
torchrun --nproc-per-node=$gpu_count run.py --data MME --model llava_v1.5_7b

It works with Spatial457 after setting NCCL_P2P_DISABLE (Ping @nahidalam )

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1)
export GPU=$gpu_count
export NCCL_P2P_DISABLE=1

cd /VLMEvalKit
torchrun --nproc-per-node=$gpu_count run.py --data Spatial457 --model llava_v1.5_7b

Without that, for Spatial457, it throws an error

RANK: 1, LOCAL_RANK: 1, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 1
RANK: 0, LOCAL_RANK: 0, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 0
RANK: 2, LOCAL_RANK: 2, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 2
RANK: 3, LOCAL_RANK: 3, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 3
[2025-07-09 02:46:28] WARNING - RUN - run.py: main - 216: --reuse is not set, will not reuse previous (before one day) temporary files
True
True
True
True
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
[2025-07-09 02:46:35,906] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 31) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run.py FAILED
------------------------------------------------------------

KranthiGV avatar Jul 09 '25 11:07 KranthiGV

@kennymckormick The above command did not work for Spatial457 dataset but worked for MME.

Essentially, the following works for MME:

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1) export GPU=$gpu_count

cd /VLMEvalKit torchrun --nproc-per-node=$gpu_count run.py --data MME --model llava_v1.5_7b It works with Spatial457 after setting NCCL_P2P_DISABLE (Ping @nahidalam )

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -1) export GPU=$gpu_count export NCCL_P2P_DISABLE=1

cd /VLMEvalKit torchrun --nproc-per-node=$gpu_count run.py --data Spatial457 --model llava_v1.5_7b Without that, for Spatial457, it throws an error

RANK: 1, LOCAL_RANK: 1, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 1
RANK: 0, LOCAL_RANK: 0, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 0
RANK: 2, LOCAL_RANK: 2, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 2
RANK: 3, LOCAL_RANK: 3, WORLD_SIZE: 4,LOCAL_WORLD_SIZE: 4, CUDA_VISIBLE_DEVICES: 3
[2025-07-09 02:46:28] WARNING - RUN - run.py: main - 216: --reuse is not set, will not reuse previous (before one day) temporary files
True
True
True
True
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
Traceback (most recent call last):
  File "/VLMEvalKit/run.py", line 515, in <module>
    main()
  File "/VLMEvalKit/run.py", line 271, in main
    dist.barrier()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'peer access is not supported between these two devices'
[2025-07-09 02:46:35,906] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 31) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run.py FAILED
------------------------------------------------------------

That sounds weird, using different datasets should not cause that difference. Did you run the evaluation of two benchmarks under exactly the same environment?

kennymckormick avatar Jul 09 '25 13:07 kennymckormick

@kennymckormick

Sorry for the confusion. Turns out I tried Spatial457 on a different type of GPU due to OOM error.

In multi-gpu setting,

GPU Notes
T4 MME works; Spatial457 OOM
A10G Both work with P2P transport disabled
A100 Both work as is

Is disabling NCCL P2P not recommended?

KranthiGV avatar Jul 10 '25 01:07 KranthiGV

@KranthiGV just tested on A100 GPU. It works without any additional setup

torchrun --nproc-per-node=8 run.py --data Spatial457 --model llava_v1.5_7b

Will take 24h to finish it in 8xA100 80GB :)

nahidalam avatar Jul 10 '25 02:07 nahidalam

okay I did not set the NCCL P2P and it failed after 20 hours 😿

Infer llava_v1.5_7b/Spatial457, Rank 4/8:  78%|███████▊  | 2328/2969 [17:20:03<3:34:59, 20.12s/it][E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600646 milliseconds before timing out.
[2025-07-10 20:08:06] ERROR - RUN - run.py: main - 505: Model llava_v1.5_7b x Dataset Spatial457 combination failed: , skipping this combination.
Traceback (most recent call last):
  File "/home/ubuntu/VLMEvalKit/run.py", line 373, in main
    model = infer_data_job(
  File "/home/ubuntu/VLMEvalKit/vlmeval/inference.py", line 216, in infer_data_job
    assert x in data_all
AssertionError
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600646 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600646 milliseconds before timing out.
Infer llava_v1.5_7b/Spatial457, Rank 4/8:  78%|███████▊  | 2330/2969 [17:21:06<4:08:13, 23.31s/it][2025-07-10 20:09:17,978] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5051 closing signal SIGTERM
[2025-07-10 20:09:17,978] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5052 closing signal SIGTERM
[2025-07-10 20:09:17,978] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5053 closing signal SIGTERM
[2025-07-10 20:09:17,979] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5054 closing signal SIGTERM
[2025-07-10 20:09:17,979] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5055 closing signal SIGTERM
[2025-07-10 20:09:17,979] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5057 closing signal SIGTERM
[2025-07-10 20:09:19,084] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 6 (pid: 5056) of binary: /opt/conda/envs/vlmevalkit/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/envs/vlmevalkit/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
run.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-07-10_20:09:17
  host      : ip-172-31-21-134.ec2.internal
  rank      : 6 (local_rank: 6)
  exitcode  : -6 (pid: 5056)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 5056
=====================================================

nahidalam avatar Jul 10 '25 20:07 nahidalam

I guess this happened because I did not do export NCCL_P2P_DISABLE=1

nahidalam avatar Jul 10 '25 20:07 nahidalam

This looks like a different error. We hit the following timeout:

if WORLD_SIZE > 1:
        import torch.distributed as dist
        dist.init_process_group(
            backend='nccl',
            timeout=datetime.timedelta(seconds=int(os.environ.get('DIST_TIMEOUT', 3600)))
        )

I'm guessing one of the GPUs was stuck and so we couldn't synchronize. But since VLMEvalKit seems to be using files to actually "synchronize data", I think it should be safe to disable its p2p communication

KranthiGV avatar Jul 10 '25 23:07 KranthiGV

after setting export NCCL_P2P_DISABLE=1, it seems to be failing faster.

All the steps below:

export NCCL_P2P_DISABLE=1

To run it in the background

nohup torchrun --nproc-per-node=8 run.py --data Spatial457 --model llava_v1.5_7b > ncclrun_spatial.log 2>&1 &
disown %1

This seems to be failing faster than before

/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/huggingface_hub/file_download.py:943: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  6.01it/s]
/home/ubuntu/VLMEvalKit/vlmeval/vlm/llava/llava.py:77: UserWarning: Following kwargs received: {'do_sample': False, 'temperature': 0, 'max_new_tokens': 2048, 'top_p': None, 'num_beams': 1, 'use_cache': True}, will use as generation config.
  warnings.warn(
Infer llava_v1.5_7b/Spatial457, Rank 4/8:   0%|          | 0/639 [00:00<?, ?it/s]/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:392: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:397: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `None` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
Infer llava_v1.5_7b/Spatial457, Rank 4/8:  29%|██▊       | 183/639 [59:31<58:29,  7.70s/it]  [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600018 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600041 milliseconds before timing out.
[2025-07-10 21:46:31] ERROR - RUN - run.py: main - 505: Model llava_v1.5_7b x Dataset Spatial457 combination failed: , skipping this combination.
Traceback (most recent call last):
  File "/home/ubuntu/VLMEvalKit/run.py", line 373, in main
    model = infer_data_job(
  File "/home/ubuntu/VLMEvalKit/vlmeval/inference.py", line 216, in infer_data_job
    assert x in data_all
AssertionError
[E ProcessGroupNCCL.cpp:475] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600365 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600426 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600594 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600426 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600426 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600365 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600365 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600758 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600018 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600018 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600041 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600041 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600802 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600802 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600802 milliseconds before timing out.
Spatial457
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600758 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600758 milliseconds before timing out.
Spatial457
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600594 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3600000) ran for 3600594 milliseconds before timing out.
[2025-07-10 21:46:37,711] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 32052 closing signal SIGTERM
[2025-07-10 21:46:37,711] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 32054 closing signal SIGTERM
[2025-07-10 21:46:37,711] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 32056 closing signal SIGTERM
[2025-07-10 21:46:39,778] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 32050) of binary: /opt/conda/envs/vlmevalkit/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/envs/vlmevalkit/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/vlmevalkit/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
run.py FAILED
------------------------------------------------------
Failures:
[1]:
  time      : 2025-07-10_21:46:37
  host      : ip-172-31-21-134.ec2.internal
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 32051)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 32051
[2]:
  time      : 2025-07-10_21:46:37
  host      : ip-172-31-21-134.ec2.internal
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 32053)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 32053
[3]:
  time      : 2025-07-10_21:46:37
  host      : ip-172-31-21-134.ec2.internal
  rank      : 5 (local_rank: 5)
  exitcode  : -6 (pid: 32055)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 32055
[4]:
  time      : 2025-07-10_21:46:37
  host      : ip-172-31-21-134.ec2.internal
  rank      : 7 (local_rank: 7)
  exitcode  : -6 (pid: 32057)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 32057
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-07-10_21:46:37
  host      : ip-172-31-21-134.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 32050)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 32050
======================================================

nahidalam avatar Jul 11 '25 02:07 nahidalam

Hi, you can run the following command:

bash scripts/run.sh --data Spatial457 --model llava_v1.5_7b

It will detect 8 GPUs and run one model instance on each GPU, thus the evaluation would be 8x faster than the python launcher.

Hi, @kennymckormick. Why is it that after I run the command, only the model is parallelized, but the evaluation is not parallelized and accelerated?

Image

zghhui avatar Jul 29 '25 01:07 zghhui

Hi, you can run the following command: bash scripts/run.sh --data Spatial457 --model llava_v1.5_7b It will detect 8 GPUs and run one model instance on each GPU, thus the evaluation would be 8x faster than the python launcher.

Hi, @kennymckormick. Why is it that after I run the command, only the model is parallelized, but the evaluation is not parallelized and accelerated?

Image

hi @zghhui ,did you solve this problem?i also encountered this.

jeffreylin122 avatar Aug 28 '25 10:08 jeffreylin122