Megatron-DeepSpeed Compute model param count once

This PR fixes #203 by using the args global variable holder to save and access model parameter counts during gigaflops counting. This is sensible given that the number of model parameters stays constant throughout the training iteration, rendering it unnecessary to call the model param count function at every iteration.

Additionally fixes: https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/123

Nov 24 '21 06:11 jaketae

@TevenLeScao , while we are at it why do we print:

estimated model parameters:
estimated model parameters without embeddings:

for the whole model?

What's the practical point of the first one? I'm trying to think what can be done with this information? I surely am missing something...

Is this info of any practical use for the pipe stages? when we print the same per process?

And I think the 2nd one is misleading, as it's not w/o embeddings. It's without repeat count of tied params.

May I suggest that it says instead:

estimated unique model parameters.

Thanks.

and if agreed for the latter then we can adjust the constant name to be args.uniq_parameters_in_billions

Nov 24 '21 21:11 stas00

Hey ! The number without embeddings is actually just that: the number of non-embedding parameters, not the number of unique parameters. This is the relevant number to estimate the loss against, which is why we're using it in the tensorboards for arch-and-scaling.

I wanted to also print exact model parameters (including embeddings and withotu double-counts) but haven't managed to get unique counting to work.

Nov 25 '21 17:11 TevenLeScao

OK, then something is very wrong in the reporting. e.g. for the 104B model it prints:

estimated model parameters without embeddings: 103.368064
estimated model parameters: 125.22432

The formula we use comes from https://github.com/bigscience-workshop/bigscience/blob/master/experiments/gpt2-utils.md#calculate-model-size

For:

NLAYERS=64
NHIDDEN=11600
SEQ_LEN=2048
VOCAB_SIZE=50257

# w/ embeddings
perl -le 'print 64*(12*11600**2+13*11600) + 50257*11600+2048*11600 + 2*11600'
103958492400

# w/o embeddings
perl -le 'print 64*(12*11600**2+13*11600) + 2*11600'
103351754400

The difference is 60M params, so embeddings params are ignored when doing math by hand and the simplified formula is used instead 12*layers*hidden**2.

So 103.368064 from estimated model parameters without embeddings: 103.368064 matches quite closely.

But there is no 125B anywhere in sight, so the estimated model parameters: 125.22432 appears to be very wrong. i.e. the last number has to be ~104B and not 125B to match the math. We have 21B extraneous params.

Nov 25 '21 18:11 stas00

@TevenLeScao, I looked deeper and we have wrong counting for PP, e.g. see:

First let's do a manual approximate math for this config:


NLAYERS=8
NHIDDEN=512
SEQ_LEN=1024
VOCAB_SIZE=50257

EMB_PARAMS=$((VOCAB_SIZE * NHIDDEN + SEQ_LEN * NHIDDEN))
BLOCKS_PARAMS=$((NLAYERS * (12 * NHIDDEN**2 + 13 * NHIDDEN)))
echo yes-emb param count $EMB_PARAMS
echo non-emb param count $BLOCKS_PARAMS

gives:

yes-emb param count 26255872
non-emb param count 25219072
sum: 51474944

So we know we are dealing with a 52M model, with about half the params taken by word embeddings.

Now run the start of the training and look at the logs (filtered out all but the relevant parts). The upcase logs are from deepspeed's pipe, the last line is ours.

# 1 gpu: tp=1 pp=1

[2021-11-28 18:33:47,870] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=15 [0, 15) STAGE_PARAMS=51500032 (51.500M) 
TOTAL_PARAMS=51500032 (51.500M) UNIQUE_PARAMS=51500032 (51.500M)

estimated model parameters: 0.051500032

# 2 gpus: tp=2 pp=1

[2021-11-28 18:33:00,537] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=15 [0, 15) STAGE_PARAMS=26057728 (26.058M) 
TOTAL_PARAMS=52115456 (52.115M) UNIQUE_PARAMS=52115456 (52.115M)
[2021-11-28 18:33:00,537] [INFO] [engine.py:151:__init__] RANK=1 STAGE=0 LAYERS=15 [0, 15) STAGE_PARAMS=26057728 (26.058M) 
TOTAL_PARAMS=52115456 (52.115M) UNIQUE_PARAMS=52115456 (52.115M)

estimated model parameters: 0.052115456


# 2 gpus: tp=1 pp=2

[2021-11-28 18:31:10,687] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=7 [0, 7) STAGE_PARAMS=38889472 (38.889M) 
TOTAL_PARAMS=77779968 (77.780M) UNIQUE_PARAMS=51500032 (51.500M)
[2021-11-28 18:31:10,687] [INFO] [engine.py:151:__init__] RANK=1 STAGE=1 LAYERS=8 [7, 15) STAGE_PARAMS=38890496 (38.890M) 
TOTAL_PARAMS=77779968 (77.780M) UNIQUE_PARAMS=51500032 (51.500M)

estimated model parameters: 0.077778944

So for gpu=1 and gpus=2/tp=2 all is good, but once pp>1 is involved it's wrong. It reports 0.077778944 but should be 0.051500032

Here is the code that we need to borrow to do it correctly:

https://github.com/microsoft/DeepSpeed/blob/7a132a9f4b37959f951b7c04a05207aba6054965/deepspeed/runtime/pipe/engine.py#L134-L157

Nov 29 '21 02:11 stas00

@stas00 @TevenLeScao I assume the PP > 1 problem is related to the warning in get_parameters_in_billions?

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c9afebc620a06d3f3109b8ead4a9ab3a6b41cb8b/megatron/utils.py#L279

And I'm also suspecting if this is the last untied knot in #40, which was followed up by #99. Just trying to put the pieces together to add clarity!

Nov 29 '21 08:11 jaketae

yes, and I pointed to the code that leads to correct data!

The current 20 to 50% over-reported size leads to very different results over what is really happening.

Do you want to tackle that, @jaketae?

Nov 29 '21 19:11 stas00

Yes, I'd love to take it. Will take a look at the DS reference code today. Thanks!

Nov 29 '21 19:11 jaketae

And to debug while you're working at it, you may choose the same as I did here that is tweaking:

N_GPUS=2
TP_SIZE=2
PP_SIZE=1

to 3 different set ups and checking that the deepspeed (upcase) log matches ours.

Nov 29 '21 20:11 stas00

@stas00 I'm trying to test the code, but haven't figured out what the best way to go about it is. Do you modify entries in run.sh and execute the script? Wondering how to best setup an environment where I can directly compare the output from DeepSpeed and Meg-DS side by side. Thank you!

Dec 01 '21 06:12 jaketae

both outputs already show up in the same log file. Please see https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/204#issuecomment-981242703 - I just filtered out that information from the rest of the logged info.

You just change the setup in the script, yes.

I don't know what run.sh is, I use the following script:

CHECKPOINT_PATH=checkpoints/gpt2

VOCAB_FILE=data/gpt2-vocab.json
MERGE_FILE=data/gpt2-merges.txt
#DATA_PATH=data/meg-gpt2_text_document
DATA_PATH=data/meg-gpt2_oscar-combined_text_document
TENSORBOARD_PATH=output_dir/tensorboard

N_GPUS=2
MICRO_BATCH_SIZE=1
GLOBAL_BATCH_SIZE=16
TP_SIZE=2
PP_SIZE=1

NLAYERS=2
NHIDDEN=64
NHEADS=2
SEQ_LEN=1024
VOCAB_SIZE=50257

SAVE_INTERVAL=50

GPT_ARGS=" \
    --num-layers $NLAYERS \
    --hidden-size $NHIDDEN \
    --num-attention-heads $NHEADS \
    --seq-length $SEQ_LEN \
    --max-position-embeddings $SEQ_LEN \
    --micro-batch-size $MICRO_BATCH_SIZE \
    --rampup-batch-size 2 2 1_000 \
    --global-batch-size $GLOBAL_BATCH_SIZE \
    --train-samples 100 \
    --optimizer adam \
    --adam-beta1 0.9 \
    --adam-beta2 0.95 \
    --adam-eps 1e-8 \
    --lr 1e-4 \
    --lr-warmup-samples 5 \
    --clip-grad 1.0 \
    --weight-decay 1e-1 \
    --fp16 \
    --partition-activations \
    --seed 42 \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE \
    "

OUTPUT_ARGS=" \
    --exit-interval 200 \
    --log-interval 10 \
    --save-interval $SAVE_INTERVAL \
    --eval-interval 100 \
    --eval-iters 10 \
    --checkpoint-activations \
    "

DATA_ARGS=" \
    --save $CHECKPOINT_PATH \
    --load $CHECKPOINT_PATH \
    --data-path $DATA_PATH \
    --tensorboard-dir $TENSORBOARD_PATH \
    --tensorboard-queue-size 5 \
    --log-timers-to-tensorboard \
    --log-batch-size-to-tensorboard \
    --log-validation-ppl-to-tensorboard \
    "


ZERO_STAGE=1

config_json="./ds_config.json"

# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
cat <<EOT > $config_json
{
  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
  "train_batch_size": $GLOBAL_BATCH_SIZE,
  "gradient_clipping": 1.0,
  "zero_optimization": {
    "stage": $ZERO_STAGE
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 500,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "initial_scale_power": 12
  },
  "steps_per_print": 2000,
  "wall_clock_breakdown": false
}
EOT


DEEPSPEED_ARGS=" \
    --deepspeed \
    --deepspeed_config ${config_json} \
    --zero-stage ${ZERO_STAGE} \
    --deepspeed-activation-checkpointing \
    "

ALL_ARGS="$GPT_ARGS $OUTPUT_ARGS $DATA_ARGS $DEEPSPEED_ARGS"

# if you can't stand pt-1.9 launcher noise
export LOGLEVEL=WARNING

PYTHONPATH=/hf/Megatron-DeepSpeed-master

MASTER_ADDR=localhost
MASTER_PORT=6777

export LAUNCHER="deepspeed --num_gpus $N_GPUS --master_port $MASTER_PORT"

# export LAUNCHER="python -u -m torch.distributed.run \
#     --nproc_per_node $N_GPUS \
#     --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
#     --rdzv_backend c10d \
#     --max_restarts 0 \
#     "

# export LAUNCHER="python -u -m torch.distributed.launch \
#     --nproc_per_node $N_GPUS \
#     --master_addr $MASTER_ADDR \
#     --master_port $MASTER_PORT \
#     "


export CMD=" \
    env PYTHONPATH=$PYTHONPATH USE_TF=0 \
    $LAUNCHER pretrain_gpt.py \
    --tensor-model-parallel-size $TP_SIZE \
    --pipeline-model-parallel-size $PP_SIZE \
    --distributed-backend nccl \
    $ALL_ARGS \
    "

echo $CMD

#rm -rf $CHECKPOINT_PATH
$CMD

If you want to use it, you will need to adjusts paths as they are hardcoded to my setup.

But the main point I'm trying to convey is that I just tweaked this section of the above script 3 times as described in my comment I linked to above and re-run the script.

N_GPUS=2
MICRO_BATCH_SIZE=1
GLOBAL_BATCH_SIZE=16
TP_SIZE=2
PP_SIZE=1

Dec 01 '21 20:12 stas00