Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[BUG] distributed optimizer doesn't work when data parallel size is odd number.

Open okoge-kaz opened this issue 1 year ago • 2 comments

Describe the bug

When the data parallel size is odd while distributed optimizer is enabled, training stops with the following error.

[Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
    return group.send([tensor], group_dst_rank, tag)
    return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

To Reproduce

The job script being used is as follows. (environment: There are 8 H100 GPUs installed in 1 node, and since 3 nodes are being used, there are a total of 3x8=24 GPUs.)

#!/bin/bash
#SBATCH --job-name=llama-2-13b
#SBATCH --partition=a3
#SBATCH --exclusive
#SBATCH --nodes 3
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=8
#SBATCH --output=outputs/llama-2-13b/%x-%j.out
#SBATCH --error=outputs/llama-2-13b/%x-%j.out

set -e

# module load
module load cuda/12.1
module load cudnn/8.9.7
module load hpcx/2.17.1

# open file limit
ulimit -n 65536 1048576

# python virtualenv
source .env/bin/activate

# Important TCPX environment variables
UDS_PATH="/run/tcpx-${SLURM_JOB_ID}"

# Only use TCPX for multi-node jobs.
[[ "${SLURM_JOB_NUM_NODES}" -gt 1 ]] && export USE_TCPX=yes || export USE_TCPX=no

# Only use TCPX for multi-node jobs.
if [[ ${USE_TCPX} = "yes" ]]; then
  # Set up NCCL Environment variables
  export NCCL_NET=GPUDirectTCPX_v7
  # These network interfaces use Ubuntu's consistent naming scheme. See
  # https://manpages.ubuntu.com/manpages/focal/man7/systemd.net-naming-scheme.7.html
  export NCCL_SOCKET_IFNAME=enp0s12
  export NCCL_GPUDIRECTTCPX_CTRL_DEV=enp0s12
  export NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=enp6s0,enp12s0,enp134s0,enp140s0
  export NCCL_CROSS_NIC=0
  export NCCL_ALGO=Ring
  export NCCL_PROTO=Simple
  export NCCL_NSOCKS_PERTHREAD=4
  export NCCL_SOCKET_NTHREADS=1
  export NCCL_DYNAMIC_CHUNK_SIZE=524288
  export NCCL_P2P_NET_CHUNKSIZE=524288
  export NCCL_P2P_PCI_CHUNKSIZE=524288
  export NCCL_P2P_NVL_CHUNKSIZE=1048576
  export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
  export NCCL_NET_GDR_LEVEL=PIX
  export NCCL_P2P_PXN_LEVEL=0
  export NCCL_GPUDIRECTTCPX_UNIX_CLIENT_PREFIX=${UDS_PATH}
  export NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=500000
  export NCCL_GPUDIRECTTCPX_TX_BINDINGS="enp6s0:8-21,112-125;enp12s0:8-21,112-125;enp134s0:60-73,164-177;enp140s0:60-73,164-177"
  export NCCL_GPUDIRECTTCPX_RX_BINDINGS="enp6s0:22-35,126-139;enp12s0:22-35,126-139;enp134s0:74-87,178-191;enp140s0:74-87,178-191"

  export LD_LIBRARY_PATH=/var/lib/tcpx/lib64:${LD_LIBRARY_PATH}
else
  unset NCCL_NET
fi

# The following two can be useful for debugging
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV,TUNING

# distributed settings
export MASTER_ADDR=$(scontrol show hostname $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=$((10000 + ($SLURM_JOBID % 50000)))

echo "MASTER_ADDR=${MASTER_ADDR}"

# hostfile
export NUM_GPU_PER_NODE=8
NODE_TYPE="H100"


NUM_NODES=$SLURM_JOB_NUM_NODES
NUM_GPUS=$((${NUM_NODES} * ${NUM_GPU_PER_NODE}))


# model config
# llama-2-13b: https://huggingface.co/meta-llama/Llama-2-13b-hf/blob/main/config.json
HIDDEN_SIZE=5120
FFN_HIDDEN_SIZE=13824 # intermediate size (HuggingFace)
NUM_LAYERS=40
NUM_HEADS=40
SEQ_LENGTH=4096

# distributed settings
TENSOR_PARALLEL_SIZE=2  # fixed
PIPELINE_PARALLEL_SIZE=4 # num layers 40: Llama-2 13B
CONTEXT_PARALLEL_SIZE=1
DATA_PARALLEL_SIZE=$((${NUM_GPUS} / (${TENSOR_PARALLEL_SIZE} * ${PIPELINE_PARALLEL_SIZE})))

# training config
MICRO_BATCH_SIZE=2
GLOBAL_BATCH_SIZE=1024
TRAIN_STEPS=500679
LR_DECAY_ITERS=452995

LR=3e-4
MIN_LR=3E-5
LR_WARMUP_STEPS=2000
WEIGHT_DECAY=0.1
GRAD_CLIP=1

# model config
TOKENIZER_MODEL=/home/ext_kazuki_fujii_rio_gsic_titech/llm-jp-tokenizer/models/ver3.0/llm-jp-tokenizer-100k.ver3.0b1.model
CHECKPOINT_SAVE_DIR=/home/ext_kazuki_fujii_rio_gsic_titech/checkpoints/Llama-2-13b/tp${TENSOR_PARALLEL_SIZE}-pp${PIPELINE_PARALLEL_SIZE}-ct${CONTEXT_PARALLEL_SIZE}-bench

mkdir -p ${CHECKPOINT_SAVE_DIR}

# data config
DATASET_DIR=/home/ext_kazuki_fujii_rio_gsic_titech/datasets/training_resharded_tokenize_ver3.0

TRAIN_DATA_PATH=""

TRAIN_DATA_PATH=""

# code stack
TRAIN_DATA_PATH="${TRAIN_DATA_PATH} 14486363187 ${DATASET_DIR}/train/code/stack_0000.jsonl_text_document"

# en wiki
TRAIN_DATA_PATH="${TRAIN_DATA_PATH} 4744259830 ${DATASET_DIR}/train/en/wiki_0000.jsonl_text_document"

# job name
JOB_NAME="llama-2-13b-base-okazaki-lab-cc-${NODE_TYPE}-${NUM_NODES}node-${NUM_GPUS}gpu-${SEQ_LENGTH}s-DP=${DATA_PARALLEL_SIZE}-TP=${TENSOR_PARALLEL_SIZE}-PP=${PIPELINE_PARALLEL_SIZE}-BS=${GLOBAL_BATCH_SIZE}-LR=${LR}-MINLR=${MIN_LR}-WARMUP=${LR_WARMUP_STEPS}-WD=${WEIGHT_DECAY}-GC=${GRAD_CLIP}-z-loss-overlap-param-gather-grad-reduce"

# --norm-epsilon 1e-5 : conifg.json (RMS norm)

CHECKPOINT_ARGS="--load ${CHECKPOINT_SAVE_DIR}"

# run
mpirun -np $NUM_GPUS \
  --npernode $NUM_GPU_PER_NODE \
  -x MASTER_ADDR=$MASTER_ADDR \
  -x MASTER_PORT=$MASTER_PORT \
  -x CUDA_DEVICE_MAX_CONNECTIONS=1 \
  -bind-to none -map-by slot \
  -x PATH \
  python pretrain_gpt.py \
  --tensor-model-parallel-size ${TENSOR_PARALLEL_SIZE} \
  --pipeline-model-parallel-size ${PIPELINE_PARALLEL_SIZE} \
  --context-parallel-size ${CONTEXT_PARALLEL_SIZE} \
  --sequence-parallel \
  --use-distributed-optimizer \
  --num-layers ${NUM_LAYERS} \
  --hidden-size ${HIDDEN_SIZE} \
  --ffn-hidden-size ${FFN_HIDDEN_SIZE} \
  --num-attention-heads ${NUM_HEADS} \
  --seq-length ${SEQ_LENGTH} \
  --max-position-embeddings ${SEQ_LENGTH} \
  --micro-batch-size ${MICRO_BATCH_SIZE} \
  --global-batch-size ${GLOBAL_BATCH_SIZE} \
  --train-iters ${TRAIN_STEPS} \
  --tokenizer-type Llama2Tokenizer \
  --tokenizer-model ${TOKENIZER_MODEL} \
  ${CHECKPOINT_ARGS} \
  --save ${CHECKPOINT_SAVE_DIR} \
  --data-path ${TRAIN_DATA_PATH} \
  --split 998,1,1 \
  --distributed-backend nccl \
  --init-method-std 0.02 \
  --lr ${LR} \
  --min-lr ${MIN_LR} \
  --lr-decay-style cosine \
  --lr-decay-iters ${LR_DECAY_ITERS} \
  --weight-decay ${WEIGHT_DECAY} \
  --clip-grad ${GRAD_CLIP} \
  --lr-warmup-iters ${LR_WARMUP_STEPS} \
  --optimizer adam \
  --adam-beta1 0.9 \
  --adam-beta2 0.95 \
  --log-interval 1 \
  --save-interval 10 \
  --eval-interval 100 \
  --eval-iters 10 \
  --bf16 \
  --untie-embeddings-and-output-weights \
  --position-embedding-type rope \
  --disable-bias-linear \
  --use-mcore-models \
  --normalization RMSNorm \
  --norm-epsilon 1e-5 \
  --no-masked-softmax-fusion \
  --attention-dropout 0.0 \
  --hidden-dropout 0.0 \
  --swiglu \
  --use-flash-attn \
  --recompute-activations \
  --recompute-granularity "selective" \
  --attention-softmax-in-fp32 \
  --transformer-impl "transformer_engine" \
  --fp8-format 'hybrid' \
  --use-mpi \
  --use-z-loss \
  --use-embedding-scaling \
  --log-throughput \
  --wandb-name ${JOB_NAME} \
  --wandb-project "Llama-2-13B" \
  --wandb-entity "nii-geniac"

Expected behavior Please enable training to proceed even when the data parallel size is odd while using the distributed optimizer.

Stack trace/logs

> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-04-23 05:51:42 
done with setup ...
training ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (697.72, 704.61)
    train/valid/test-data-iterators-setup ..........: (781007.39, 781817.93)
[before the start of training step] datetime: 2024-04-23 05:51:42 
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1269, in forward_backward_pipelining_without_interleaving
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1269, in forward_backward_pipelining_without_interleaving
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1269, in forward_backward_pipelining_without_interleaving
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1269, in forward_backward_pipelining_without_interleaving
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1269, in forward_backward_pipelining_without_interleaving
    input_tensor = recv_forward(recv_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    input_tensor = recv_forward(recv_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    input_tensor = recv_forward(recv_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    input_tensor = recv_forward(recv_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    input_tensor = recv_forward(recv_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1269, in forward_backward_pipelining_without_interleaving
    input_tensor = recv_forward(recv_tensor_shapes, config)
    return pg.recv([tensor], group_src_rank, tag)
    return pg.recv([tensor], group_src_rank, tag)
    return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
[Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 3
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
[Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 3
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
[Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 3
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
    return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

    return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

    return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
[Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 3
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
    input_tensor = recv_forward(recv_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    input_tensor = recv_forward(recv_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    input_tensor = recv_forward(recv_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    input_tensor = recv_forward(recv_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    input_tensor = recv_forward(recv_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    input_tensor = recv_forward(recv_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
    return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
[Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 3
    return pg.recv([tensor], group_src_rank, tag)
    return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
[Proxy Service 1] Failed to execute operation Connect from rank 1, retcode
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

    return pg.recv([tensor], group_src_rank, tag)
    return pg.recv([tensor], group_src_rank, tag)
    return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
[Proxy Service 2] Failed to execute operation Connect from rank 2, retcode 3
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
[Proxy Service 2] Failed to execute operation Connect from rank 2, retcode 3
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
Traceback (most recent call last):
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/pretrain_gpt.py", line 229, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    pretrain(train_valid_test_datasets_provider,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 317, in pretrain
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1258, in forward_backward_pipelining_without_interleaving
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1258, in forward_backward_pipelining_without_interleaving
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1258, in forward_backward_pipelining_without_interleaving
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1244, in forward_backward_pipelining_without_interleaving
    iteration, num_floating_point_operations_so_far = train(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 1194, in train
    train_step(forward_step_func,
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/training/training.py", line 589, in train_step
    send_forward(output_tensor, send_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1082, in send_forward
    send_forward(output_tensor, send_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1082, in send_forward
    send_forward(output_tensor, send_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1082, in send_forward
    input_tensor = recv_forward(recv_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    input_tensor = recv_forward(recv_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1062, in recv_forward
    losses_reduced = forward_backward_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1258, in forward_backward_pipelining_without_interleaving
    p2p_communication.send_forward(output_tensor, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 406, in send_forward
    p2p_communication.send_forward(output_tensor, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 406, in send_forward
    p2p_communication.send_forward(output_tensor, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 406, in send_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    input_tensors.append(p2p_communication.recv_forward(tensor_shape, config))
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 361, in recv_forward
    send_forward(output_tensor, send_tensor_shapes, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1082, in send_forward
    p2p_communication.send_forward(output_tensor, config)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 406, in send_forward
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    input_tensor, _, _ = _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    _communicate(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 329, in _communicate
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = p2p_func(
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 162, in _batched_p2p_ops
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    reqs = torch.distributed.batch_isend_irecv(ops)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1865, in batch_isend_irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1591, in isend
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1591, in isend
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1631, in irecv
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1591, in isend
    p2p_op.op(p2p_op.tensor, p2p_op.peer, p2p_op.group, p2p_op.tag)
  File "/home/ext_kazuki_fujii_rio_gsic_titech/src/Megatron-LM/.env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1591, in isend
    return pg.recv([tensor], group_src_rank, tag)
    return group.send([tensor], group_dst_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

    return group.send([tensor], group_dst_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
[Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
[Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
    return group.send([tensor], group_dst_rank, tag)
    return group.send([tensor], group_dst_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
[Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
    return pg.recv([tensor], group_src_rank, tag)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
[Proxy Service 0] Failed to execute operation Connect from rank , res=3, cl
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
wandb: - 0.042 MB of 0.042 MB uploaded
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[8911,1],11]
  Exit code:    1
--------------------------------------------------------------------------

Environment (please complete the following information):

  • Megatron-LM commit ID: 2196398f5252ead6f036b06d45f7acb89b1308da
  • PyTorch version: 2.2.0
  • CUDA version: 12.1
  • NCCL version: 2.19.3

okoge-kaz avatar Apr 23 '24 06:04 okoge-kaz

Did you use a PyTorch NGC container? If so, what version?

deepakn94 avatar Apr 26 '24 02:04 deepakn94

@deepakn94

Sorry for the late reply, I don't use PyTorch NGC Container.

okoge-kaz avatar Apr 28 '24 14:04 okoge-kaz

@deepakn94

We did not encounter the problem when using the university's supercomputer (H100: TSUBAME-4 ) instead of the A3 Instance on Google Cloud Platform. I'm not sure what the issue is, but it might be due to our use of GPUDirectTCPX_v7 on GCP. Thank you for your response.

okoge-kaz avatar May 06 '24 17:05 okoge-kaz

Disponha meu amigo se precisar só chamar aqui

Em seg, 6 de mai de 2024 15:00, Kazuki Fujii @.***> escreveu:

Closed #792 https://github.com/NVIDIA/Megatron-LM/issues/792 as completed.

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/Megatron-LM/issues/792#event-12718469549, or unsubscribe https://github.com/notifications/unsubscribe-auth/BH4G62MRVKFUWMT7H7MGYK3ZA7AK3AVCNFSM6AAAAABGUD4OPOVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSG4YTQNBWHE2TIOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

felipeliliti avatar May 06 '24 18:05 felipeliliti