[BUG] distributed optimizer doesn't work when data parallel size is odd number.
Describe the bug
When the data parallel size is odd while distributed optimizer is enabled, training stops with the following error.
[Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
return group.send([tensor], group_dst_rank, tag)
return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
To Reproduce
The job script being used is as follows. (environment: There are 8 H100 GPUs installed in 1 node, and since 3 nodes are being used, there are a total of 3x8=24 GPUs.)
#!/bin/bash
#SBATCH --job-name=llama-2-13b
#SBATCH --partition=a3
#SBATCH --exclusive
#SBATCH --nodes 3
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=8
#SBATCH --output=outputs/llama-2-13b/%x-%j.out
#SBATCH --error=outputs/llama-2-13b/%x-%j.out
set -e
# module load
module load cuda/12.1
module load cudnn/8.9.7
module load hpcx/2.17.1
# open file limit
ulimit -n 65536 1048576
# python virtualenv
source .env/bin/activate
# Important TCPX environment variables
UDS_PATH="/run/tcpx-${SLURM_JOB_ID}"
# Only use TCPX for multi-node jobs.
[[ "${SLURM_JOB_NUM_NODES}" -gt 1 ]] && export USE_TCPX=yes || export USE_TCPX=no
# Only use TCPX for multi-node jobs.
if [[ ${USE_TCPX} = "yes" ]]; then
# Set up NCCL Environment variables
export NCCL_NET=GPUDirectTCPX_v7
# These network interfaces use Ubuntu's consistent naming scheme. See
# https://manpages.ubuntu.com/manpages/focal/man7/systemd.net-naming-scheme.7.html
export NCCL_SOCKET_IFNAME=enp0s12
export NCCL_GPUDIRECTTCPX_CTRL_DEV=enp0s12
export NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=enp6s0,enp12s0,enp134s0,enp140s0
export NCCL_CROSS_NIC=0
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_SOCKET_NTHREADS=1
export NCCL_DYNAMIC_CHUNK_SIZE=524288
export NCCL_P2P_NET_CHUNKSIZE=524288
export NCCL_P2P_PCI_CHUNKSIZE=524288
export NCCL_P2P_NVL_CHUNKSIZE=1048576
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NCCL_NET_GDR_LEVEL=PIX
export NCCL_P2P_PXN_LEVEL=0
export NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX=${UDS_PATH}
export NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=500000
export NCCL_GPUDIRECTTCPX_TX_BINDINGS="enp6s0:8-21,112-125;enp12s0:8-21,112-125;enp134s0:60-73,164-177;enp140s0:60-73,164-177"
export NCCL_GPUDIRECTTCPX_RX_BINDINGS="enp6s0:22-35,126-139;enp12s0:22-35,126-139;enp134s0:74-87,178-191;enp140s0:74-87,178-191"
export LD_LIBRARY_PATH=/var/lib/tcpx/lib64:${LD_LIBRARY_PATH}
else
unset NCCL_NET
fi
# The following two can be useful for debugging
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV,TUNING
# distributed settings
export MASTER_ADDR=$(scontrol show hostname $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=$((10000 + ($SLURM_JOBID % 50000)))
echo "MASTER_ADDR=${MASTER_ADDR}"
# hostfile
export NUM_GPU_PER_NODE=8
NODE_TYPE="H100"
NUM_NODES=$SLURM_JOB_NUM_NODES
NUM_GPUS=$((${NUM_NODES} * ${NUM_GPU_PER_NODE}))
# model config
# llama-2-13b: https://huggingface.co/meta-llama/Llama-2-13b-hf/blob/main/config.json
HIDDEN_SIZE=5120
FFN_HIDDEN_SIZE=13824 # intermediate size (HuggingFace)
NUM_LAYERS=40
NUM_HEADS=40
SEQ_LENGTH=4096
# distributed settings
TENSOR_PARALLEL_SIZE=2 # fixed
PIPELINE_PARALLEL_SIZE=4 # num layers 40: Llama-2 13B
CONTEXT_PARALLEL_SIZE=1
DATA_PARALLEL_SIZE=$((${NUM_GPUS} / (${TENSOR_PARALLEL_SIZE} * ${PIPELINE_PARALLEL_SIZE})))
# training config
MICRO_BATCH_SIZE=2
GLOBAL_BATCH_SIZE=1024
TRAIN_STEPS=500679
LR_DECAY_ITERS=452995
LR=3e-4
MIN_LR=3E-5
LR_WARMUP_STEPS=2000
WEIGHT_DECAY=0.1
GRAD_CLIP=1
# model config
TOKENIZER_MODEL=/home/ext_kazuki_fujii_rio_gsic_titech/llm-jp-tokenizer/models/ver3.0/llm-jp-tokenizer-100k.ver3.0b1.model
CHECKPOINT_SAVE_DIR=/home/ext_kazuki_fujii_rio_gsic_titech/checkpoints/Llama-2-13b/tp${TENSOR_PARALLEL_SIZE}-pp${PIPELINE_PARALLEL_SIZE}-ct${CONTEXT_PARALLEL_SIZE}-bench
mkdir -p ${CHECKPOINT_SAVE_DIR}
# data config
DATASET_DIR=/home/ext_kazuki_fujii_rio_gsic_titech/datasets/training_resharded_tokenize_ver3.0
TRAIN_DATA_PATH=""
TRAIN_DATA_PATH=""
# code stack
TRAIN_DATA_PATH="${TRAIN_DATA_PATH} 14486363187 ${DATASET_DIR}/train/code/stack_0000.jsonl_text_document"
# en wiki
TRAIN_DATA_PATH="${TRAIN_DATA_PATH} 4744259830 ${DATASET_DIR}/train/en/wiki_0000.jsonl_text_document"
# job name
JOB_NAME="llama-2-13b-base-okazaki-lab-cc-${NODE_TYPE}-${NUM_NODES}node-${NUM_GPUS}gpu-${SEQ_LENGTH}s-DP=${DATA_PARALLEL_SIZE}-TP=${TENSOR_PARALLEL_SIZE}-PP=${PIPELINE_PARALLEL_SIZE}-BS=${GLOBAL_BATCH_SIZE}-LR=${LR}-MINLR=${MIN_LR}-WARMUP=${LR_WARMUP_STEPS}-WD=${WEIGHT_DECAY}-GC=${GRAD_CLIP}-z-loss-overlap-param-gather-grad-reduce"
# --norm-epsilon 1e-5 : conifg.json (RMS norm)
CHECKPOINT_ARGS="--load ${CHECKPOINT_SAVE_DIR}"
# run
mpirun -np $NUM_GPUS \
--npernode $NUM_GPU_PER_NODE \
-x MASTER_ADDR=$MASTER_ADDR \
-x MASTER_PORT=$MASTER_PORT \
-x CUDA_DEVICE_MAX_CONNECTIONS=1 \
-bind-to none -map-by slot \
-x PATH \
python pretrain_gpt.py \
--tensor-model-parallel-size ${TENSOR_PARALLEL_SIZE} \
--pipeline-model-parallel-size ${PIPELINE_PARALLEL_SIZE} \
--context-parallel-size ${CONTEXT_PARALLEL_SIZE} \
--sequence-parallel \
--use-distributed-optimizer \
--num-layers ${NUM_LAYERS} \
--hidden-size ${HIDDEN_SIZE} \
--ffn-hidden-size ${FFN_HIDDEN_SIZE} \
--num-attention-heads ${NUM_HEADS} \
--seq-length ${SEQ_LENGTH} \
--max-position-embeddings ${SEQ_LENGTH} \
--micro-batch-size ${MICRO_BATCH_SIZE} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--train-iters ${TRAIN_STEPS} \
--tokenizer-type Llama2Tokenizer \
--tokenizer-model ${TOKENIZER_MODEL} \
${CHECKPOINT_ARGS} \
--save ${CHECKPOINT_SAVE_DIR} \
--data-path ${TRAIN_DATA_PATH} \
--split 998,1,1 \
--distributed-backend nccl \
--init-method-std 0.02 \
--lr ${LR} \
--min-lr ${MIN_LR} \
--lr-decay-style cosine \
--lr-decay-iters ${LR_DECAY_ITERS} \
--weight-decay ${WEIGHT_DECAY} \
--clip-grad ${GRAD_CLIP} \
--lr-warmup-iters ${LR_WARMUP_STEPS} \
--optimizer adam \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--log-interval 1 \
--save-interval 10 \
--eval-interval 100 \
--eval-iters 10 \
--bf16 \
--untie-embeddings-and-output-weights \
--position-embedding-type rope \
--disable-bias-linear \
--use-mcore-models \
--normalization RMSNorm \
--norm-epsilon 1e-5 \
--no-masked-softmax-fusion \
--attention-dropout 0.0 \
--hidden-dropout 0.0 \
--swiglu \
--use-flash-attn \
--recompute-activations \
--recompute-granularity "selective" \
--attention-softmax-in-fp32 \
--transformer-impl "transformer_engine" \
--fp8-format 'hybrid' \
--use-mpi \
--use-z-loss \
--use-embedding-scaling \
--log-throughput \
--wandb-name ${JOB_NAME} \
--wandb-project "Llama-2-13B" \
--wandb-entity "nii-geniac"
Expected behavior Please enable training to proceed even when the data parallel size is odd while using the distributed optimizer.
Stack trace/logs
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-04-23 05:51:42
done with setup ...
training ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (697.72, 704.61)
train/valid/test-data-iterators-setup ..........: (781007.39, 781817.93)
[before the start of training step] datetime: 2024-04-23 05:51:42
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1269, in forward_backward_pipelining_without_interleaving
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1269, in forward_backward_pipelining_without_interleaving
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1269, in forward_backward_pipelining_without_interleaving
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1269, in forward_backward_pipelining_without_interleaving
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1269, in forward_backward_pipelining_without_interleaving
input_tensor = recv_forward(recv_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
input_tensor = recv_forward(recv_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
input_tensor = recv_forward(recv_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
input_tensor = recv_forward(recv_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
input_tensor = recv_forward(recv_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1269, in forward_backward_pipelining_without_interleaving
input_tensor = recv_forward(recv_tensor_shapes, config)
return pg.recv([tensor], group_src_rank, tag)
return pg.recv([tensor], group_src_rank, tag)
return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
[Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 3
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
[Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 3
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
[Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 3
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
[Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 3
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
input_tensor = recv_forward(recv_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
input_tensor = recv_forward(recv_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
input_tensor = recv_forward(recv_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
input_tensor = recv_forward(recv_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
input_tensor = recv_forward(recv_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
input_tensor = recv_forward(recv_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
[Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 3
return pg.recv([tensor], group_src_rank, tag)
return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
[Proxy Service 1] Failed to execute operation Connect from rank 1, retcode
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
return pg.recv([tensor], group_src_rank, tag)
return pg.recv([tensor], group_src_rank, tag)
return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
[Proxy Service 2] Failed to execute operation Connect from rank 2, retcode 3
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
[Proxy Service 2] Failed to execute operation Connect from rank 2, retcode 3
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
pretrain(train_valid_test_datasets_provider,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1258, in forward_backward_pipelining_without_interleaving
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1258, in forward_backward_pipelining_without_interleaving
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1258, in forward_backward_pipelining_without_interleaving
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
iteration, num_floating_point_operations_so_far = train(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
train_step(forward_step_func,
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
send_forward(output_tensor, send_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1082, in send_forward
send_forward(output_tensor, send_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1082, in send_forward
send_forward(output_tensor, send_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1082, in send_forward
input_tensor = recv_forward(recv_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
input_tensor = recv_forward(recv_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
losses_reduced = forward_backward_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1258, in forward_backward_pipelining_without_interleaving
p2p_communication.send_forward(output_tensor, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 406, in send_forward
p2p_communication.send_forward(output_tensor, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 406, in send_forward
p2p_communication.send_forward(output_tensor, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 406, in send_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
send_forward(output_tensor, send_tensor_shapes, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1082, in send_forward
p2p_communication.send_forward(output_tensor, config)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 406, in send_forward
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
_communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
_communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
input_tensor, _, _ = _communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
_communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
_communicate(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = p2p_func(
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
reqs = torch.distributed.batch_isend_irecv(ops)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1591, in isend
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1591, in isend
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1591, in isend
p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1591, in isend
return pg.recv([tensor], group_src_rank, tag)
return group.send([tensor], group_dst_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
return group.send([tensor], group_dst_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
[Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
[Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
return group.send([tensor], group_dst_rank, tag)
return group.send([tensor], group_dst_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
[Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
[Proxy Service 0] Failed to execute operation Connect from rank , res=3, cl
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
wandb: - 0.042 MB of 0.042 MB uploaded
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[8911,1],11]
Exit code: 1
--------------------------------------------------------------------------
Environment (please complete the following information):
- Megatron-LM commit ID: 2196398f5252ead6f036b06d45f7acb89b1308da
- PyTorch version: 2.2.0
- CUDA version: 12.1
- NCCL version: 2.19.3
Did you use a PyTorch NGC container? If so, what version?
@deepakn94
Sorry for the late reply, I don't use PyTorch NGC Container.
@deepakn94
We did not encounter the problem when using the university's supercomputer (H100: TSUBAME-4 ) instead of the A3 Instance on Google Cloud Platform. I'm not sure what the issue is, but it might be due to our use of GPUDirectTCPX_v7 on GCP. Thank you for your response.
Disponha meu amigo se precisar só chamar aqui
Em seg, 6 de mai de 2024 15:00, Kazuki Fujii @.***> escreveu:
Closed #792 https://github.com/NVIDIA/Megatron-LM/issues/792 as completed.
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/Megatron-LM/issues/792#event-12718469549, or unsubscribe https://github.com/notifications/unsubscribe-auth/BH4G62MRVKFUWMT7H7MGYK3ZA7AK3AVCNFSM6AAAAABGUD4OPOVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSG4YTQNBWHE2TIOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>