Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

Question with forward_backward_pipelining_without_interleaving in Megatron-LM Pipeline

Open Hongjie1Chu opened this issue 1 year ago • 1 comments

I encountered a problem when using the Megatron pipeline. The function I am using is forward_backward_pipelining_without_interleaving. In this pipeline function, each pipeline stage calls forward_step for the forward pass:

output_tensor = forward_step(forward_step_func, data_iterator, model, input_tensor, losses_reduced)

The input for the forward pass should be the output from the previous stage. However, in the megatron/schedule.py file, the forward_step function is defined as follows:

unwrapped_model.set_input_tensor(input_tensor) output_tensor, loss_func = forward_step_func(data_iterator, model)

This implies that each stage in the forward pass still gets data from the dataset and processes it, which seems to contradict the concept of pipelining. Could you please explain the rationale behind this design?

code in pretrained_gpy.py:

image

Here are my results: image

My configuration:

GPUS_PER_NODE=4

Change for multinode config

MASTER_ADDR=172.20.20.220 MASTER_PORT=6000 NNODES=1 NODE_RANK=0 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

DATA_PATH=data/my-gpt2_text_document CHECKPOINT_PATH=model/model_optim_rng.pt MODEL_PATH=model/output/pp DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

python -m torch.distributed.launch $DISTRIBUTED_ARGS
pretrain_gpt.py
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 4
--num-layers 12
--hidden-size 1024
--num-attention-heads 16
--micro-batch-size 16
--global-batch-size 64
--seq-length 1024
--max-position-embeddings 1024
--train-iters 1

Feel free to adjust anything as needed before posting!

Hongjie1Chu avatar May 17 '24 06:05 Hongjie1Chu

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jul 16 '24 18:07 github-actions[bot]