gramesh-amd comments

Results 11 comments of


                                            gramesh-amd

converted mlperf gpt3 ckpt starts with a worse loss

@ZhiyuLi-goog thanks again for your help with other issues. Do you see any problems with the config or know why the loss is much higher?

converted mlperf gpt3 ckpt starts with a worse loss

with attention: "dot_product" : completed step: 4000, seconds: 91.772, TFLOP/s/device: 24.021, Tokens/s/device: 22.316, total_weights: 65504, loss: 7.644, perplexity: 2088.295 To see full metrics 'tensorboard --logdir=/ckpts/paxml/gpt3-conversion/gpt3-conversion/tensorboard/' completed step: 4001, seconds: 39.677,...

converted mlperf gpt3 ckpt starts with a worse loss

I tested these out. First running ``` python3 MaxText/train.py MaxText/configs/base.yml run_name="${RUNNAME}" model_name=gpt3-175b ``` and then also adding the other relevant flags you posted one by one and all of them...

converted mlperf gpt3 ckpt starts with a worse loss

[maxtext_gpt3_logs.txt](https://github.com/user-attachments/files/17023063/maxtext_gpt3_logs.txt) Thanks. Here are the logs

converted mlperf gpt3 ckpt starts with a worse loss

Thanks for checking yeah its strange that its starting with a bad loss. I also tried testing the tokenizer and it also seems fine

converted mlperf gpt3 ckpt starts with a worse loss

Tried the weight_dtype as float32 as well. Same problem im wondering if we can send you our converted ckpt for you to load and verify its an ckpt problem?

converted mlperf gpt3 ckpt starts with a worse loss

im not sure if it will be useful. We also loaded the pax ckpt directly in paxml and the ckpt starts at the right loss. So at this point, we...

converted mlperf gpt3 ckpt starts with a worse loss

great we will share the converted ckpt and the conversion logs. Do you have a gcloud bucket that i could push it to? or do you recommend some other way?

converted mlperf gpt3 ckpt starts with a worse loss

ok, let me do that We tried both versions and with both, we are getting the same problem

converted mlperf gpt3 ckpt starts with a worse loss

We have created the bucket and will share the access with you soon (I got your google email from one of your [commits](https://github.com/mlcommons/training_results_v4.0/commit/62f111b7690f163b269f32e4f93dcaaa13717c9c))