Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

hello, I meet a problem

Open etoilestar opened this issue 2 years ago • 8 comments

hello, when I run script to train gpt model,I meet an assertion error:Not sure how to proceed, we were given deepspeed configs in the deepspeed arguments and deepspeed. the script I used is https://github.com/bigscience-workshop/Megatron-DeepSpeed#deepspeed-pp-and-zero-dp. can you tell me why?

etoilestar avatar May 22 '23 09:05 etoilestar

Can you please share the assertion message and stack trace?

tjruwase avatar May 22 '23 16:05 tjruwase

Please try https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/run_bf16.sh or the equivalent run_fp16.sh

tjruwase avatar May 22 '23 16:05 tjruwase

ok, I will have a try. on the other hand, I cannot find BF16Optimizer mentioned at https://huggingface.co/blog/zh/bloom-megatron-deepspeed#bf16optimizer, could you give me some tips?

etoilestar avatar May 22 '23 17:05 etoilestar

https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/bf16_optimizer.py

tjruwase avatar May 22 '23 17:05 tjruwase

I met the same problem when I was following the "start_fast.md".I want to know how to solve the question,Thank you!

hymie122 avatar Jun 06 '23 06:06 hymie122

comment line 429 args=args in megatron/training.py will solve this problem.

model, optimizer, _, lr_scheduler = deepspeed.initialize(
    model=model[0],
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    config=config,
    #args=args,
)

AoZhang avatar Jun 13 '23 09:06 AoZhang

deepspeed.initialize can't be given both config and args.deepspeed_config, you should remove one of them.

murphypei avatar Jun 27 '23 04:06 murphypei

comment line 429 args=args in megatron/training.py will solve this problem.

model, optimizer, _, lr_scheduler = deepspeed.initialize(
    model=model[0],
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    config=config,
    #args=args,
)

jesus!!!!!!

divisionblur avatar May 16 '24 00:05 divisionblur