DeepLearningExamples icon indicating copy to clipboard operation
DeepLearningExamples copied to clipboard

[BERT/Pytorch] Same Results for run_glue.py

Open Itok2000u opened this issue 3 years ago • 1 comments

Related to Bert/Pytorch

Describe the bug In the latest version of the run_glue.py(MRPC), I am trying to load a checkpoint from a history version of NvidiaBert(which used FusedAdam as the optimizer). The problem is no matter which checkpoint am I using(E.g ckpt from steps 6000, steps 36000, steps 525000), no matter which batchsize, lr, or seeds am I using, I got the same F1 score and same Exact Match, which is [exact_match : 0.6838235294117647 F1 : 0.8122270742358079] and even the same pytorch_model.bin.

However, when I use the pretrained Bert model (Bert-Base-Uncased-Pretrained)downloaded from Nvidia's website, I got a different result(F1 is 89). And this phenomenon happens on every fine-tune task, including swag, squad and glue. The possible reason I think is the size of the checkpoints. I notice that the size of website pretrain model's weight is [30528, 768]. However, mine is [30522, 768].

In order to let my ckpt to be loaded successfully, I added a line of code at run_glue.py as the following to convert vocab_size as 30522.(In the original code since 30522 is not a multiple of 8, the vocab_size will be converted to 30528)

if config.vocab_size % 8 != 0:
    config.vocab_size += 8 - (config.vocab_size % 8)
config.vocab_size = 30522

LogFile /home/lcyx/.conda/envs/bert/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 0 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 3 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 1 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 2 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. 06/09/2022 14:28:00 - INFO - main - device: cuda:0 n_gpu: 1, distributed training: True, 16-bits training: True 06/09/2022 14:28:00 - WARNING - main - Output directory (/data/scratch/bert_pretrain/glue2) already exists and is not empty. 06/09/2022 14:28:00 - INFO - main - device: cuda:3 n_gpu: 1, distributed training: True, 16-bits training: True 06/09/2022 14:28:00 - INFO - main - device: cuda:1 n_gpu: 1, distributed training: True, 16-bits training: True 06/09/2022 14:28:00 - INFO - main - device: cuda:2 n_gpu: 1, distributed training: True, 16-bits training: True DLL 2022-06-09 14:28:04.664127 - PARAMETER Config : ["Namespace(amp=False, bert_model='bert-base-uncased', config_file='/home/lcyx/Bert2/DeepLearningExamples/PyTorch/LanguageModeling/BERT/bert_config.json', data_dir='/home/lcyx/Bert2/DeepLearningExamples/PyTorch/LanguageModeling/BERT/MRPC', do_eval=True, do_lower_case=True, do_predict=False, do_train=True, eval_batch_size=16, fp16=True, gradient_accumulation_steps=1, init_checkpoint='/data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt', learning_rate=2.4e-05, local_rank=0, loss_scale=0, max_seq_length=128, max_steps=-1.0, no_cuda=False, num_train_epochs=3.0, output_dir='/data/scratch/bert_pretrain/glue2', seed=2, server_ip='', server_port='', skip_checkpoint=False, task_name='mrpc', train_batch_size=16, vocab_file='/home/lcyx/code/bert-base-uncased-vocab.txt', warmup_proportion=0.1)"] DLL 2022-06-09 14:28:04.664463 - PARAMETER SEED : 2 06/09/2022 14:28:04 - INFO - main - Loaded pre-processed features from /home/lcyx/Bert2/DeepLearningExamples/PyTorch/LanguageModeling/BERT/MRPC/bert-base-uncased_128_True 06/09/2022 14:28:04 - INFO - main - Loaded pre-processed features from /home/lcyx/Bert2/DeepLearningExamples/PyTorch/LanguageModeling/BERT/MRPC/bert-base-uncased_128_True 06/09/2022 14:28:04 - INFO - main - Loaded pre-processed features from /home/lcyx/Bert2/DeepLearningExamples/PyTorch/LanguageModeling/BERT/MRPC/bert-base-uncased_128_True 06/09/2022 14:28:04 - INFO - main - Loaded pre-processed features from /home/lcyx/Bert2/DeepLearningExamples/PyTorch/LanguageModeling/BERT/MRPC/bert-base-uncased_128_True 06/09/2022 14:28:05 - INFO - main - USING CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt 06/09/2022 14:28:05 - INFO - main - USING CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt 06/09/2022 14:28:05 - INFO - main - USING CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt 06/09/2022 14:28:06 - INFO - main - USING CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt 06/09/2022 14:28:06 - INFO - main - USED CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt 06/09/2022 14:28:06 - INFO - main - USED CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt 06/09/2022 14:28:06 - INFO - main - USED CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt DLL 2022-06-09 14:28:06.523055 - PARAMETER num_parameters : 109483778 06/09/2022 14:28:06 - INFO - main - using fp16 06/09/2022 14:28:06 - INFO - main - ***** Running training ***** 06/09/2022 14:28:06 - INFO - main - Num examples = 3668 06/09/2022 14:28:06 - INFO - main - Batch size = 16 06/09/2022 14:28:06 - INFO - main - Num steps = 171 06/09/2022 14:28:06 - INFO - main - using fp16 06/09/2022 14:28:06 - INFO - main - using fp16 Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : True master_weights : True loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : False master_weights : True loss_scale : dynamic Epoch: 0%| | 0/3 [00:00<?, ?it/s] 06/09/2022 14:28:06 - INFO - main - ***** Running training ***** 06/09/2022 14:28:06 - INFO - main - Num examples = 3668 06/09/2022 14:28:06 - INFO - main - Batch size = 16 06/09/2022 14:28:06 - INFO - main - Num steps = 171 06/09/2022 14:28:06 - INFO - main - ***** Running training ***** 06/09/2022 14:28:06 - INFO - main - Num examples = 3668 06/09/2022 14:28:06 - INFO - main - Batch size = 16 06/09/2022 14:28:06 - INFO - main - Num steps = 171 06/09/2022 14:28:06 - INFO - main - USED CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt Epoch: 0%| | 0/3 [00:00<?, ?it/s] 06/09/2022 14:28:06 - INFO - main - using fp16 06/09/2022 14:28:06 - INFO - main - ***** Running training ***** 06/09/2022 14:28:06 - INFO - main - Num examples = 3668 06/09/2022 14:28:06 - INFO - main - Batch size = 16 06/09/2022 14:28:06 - INFO - main - Num steps = 171 Epoch: 0%| | 0/3 [00:00<?, ?it/s] Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Iteration: 100%|██████████| 58/58 [00:03<00:00, 18.14it/s] Iteration: 100%|██████████| 58/58 [00:03<00:00, 18.94it/s] Iteration: 100%|██████████| 58/58 [00:03<00:00, 18.35it/s] Iteration: 100%|██████████| 58/58 [00:03<00:00, 18.33it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.78it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.77it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.79it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.76it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.69it/s] Epoch: 100%|██████████| 3/3 [00:07<00:00, 2.47s/it]4it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.69it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.71it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.69it/s] Epoch: 100%|██████████| 3/3 [00:07<00:00, 2.51s/it] Epoch: 100%|██████████| 3/3 [00:07<00:00, 2.50s/it] Epoch: 100%|██████████| 3/3 [00:07<00:00, 2.50s/it] 06/09/2022 14:28:15 - INFO - main - ***** Running evaluation ***** 06/09/2022 14:28:15 - INFO - main - Num examples = 408 06/09/2022 14:28:15 - INFO - main - Batch size = 16 Evaluating: 26it [00:00, 139.78it/s] 06/09/2022 14:28:15 - INFO - main - ***** Results ***** 06/09/2022 14:28:15 - INFO - main - acc = 0.6838235294117647 06/09/2022 14:28:15 - INFO - main - acc_and_f1 = 0.7480253018237863 06/09/2022 14:28:15 - INFO - main - eval:loss = 0.6253539598905123 06/09/2022 14:28:15 - INFO - main - eval:num_samples_per_gpu = 408 06/09/2022 14:28:15 - INFO - main - eval:num_steps = 26 06/09/2022 14:28:15 - INFO - main - f1 = 0.8122270742358079 06/09/2022 14:28:15 - INFO - main - global_step = 174 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):100% = 7.602047920227051 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):50% = 6.606847763061523 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):90% = 6.9673919677734375 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):95% = 6.9813761711120605 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):99% = 7.1487040519714355 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):avg = 6.689874428969163 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):std = 0.27139754700072705 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):sum = 173.93673515319824 06/09/2022 14:28:15 - INFO - main - infer:throughput(samples/s):avg = 2391.674189087197 06/09/2022 14:28:15 - INFO - main - train:latency = 7.502729503903538 06/09/2022 14:28:15 - INFO - main - train:loss = 0.6459138223837162 06/09/2022 14:28:15 - INFO - main - train:num_samples_per_gpu = 2751 06/09/2022 14:28:15 - INFO - main - train:num_steps = 58 06/09/2022 14:28:15 - INFO - main - train:throughput = 1466.666230506486 DLL 2022-06-09 14:28:15.217692 - exact_match : 0.6838235294117647 DLL 2022-06-09 14:28:15.217769 - F1 : 0.8122270742358079 DLL 2022-06-09 14:28:15.217797 - e2e_train_time : 7.502729503903538 DLL 2022-06-09 14:28:15.217821 - training_sequences_per_second : 1466.666230506486 DLL 2022-06-09 14:28:15.217843 - e2e_inference_time : 0.17393673515319824 DLL 2022-06-09 14:28:15.217863 - inference_sequences_per_second : 2391.674189087197

Environment Please provide at least:

  • Pytorch: 1.10.1
  • GPUs in the system: 8x Nvidia A100:
  • CUDA: 11.3

Itok2000u avatar Jun 10 '22 04:06 Itok2000u

This seems like an incorrect checkpoint load, few keys are missing or mismatch.

Could you inspect the weight names in the model and the checkpoint?

sharathts avatar Jun 27 '22 23:06 sharathts