[BERT/Pytorch] Same Results for run_glue.py
Related to Bert/Pytorch
Describe the bug In the latest version of the run_glue.py(MRPC), I am trying to load a checkpoint from a history version of NvidiaBert(which used FusedAdam as the optimizer). The problem is no matter which checkpoint am I using(E.g ckpt from steps 6000, steps 36000, steps 525000), no matter which batchsize, lr, or seeds am I using, I got the same F1 score and same Exact Match, which is [exact_match : 0.6838235294117647 F1 : 0.8122270742358079] and even the same pytorch_model.bin.
However, when I use the pretrained Bert model (Bert-Base-Uncased-Pretrained)downloaded from Nvidia's website, I got a different result(F1 is 89). And this phenomenon happens on every fine-tune task, including swag, squad and glue. The possible reason I think is the size of the checkpoints. I notice that the size of website pretrain model's weight is [30528, 768]. However, mine is [30522, 768].
In order to let my ckpt to be loaded successfully, I added a line of code at run_glue.py as the following to convert vocab_size as 30522.(In the original code since 30522 is not a multiple of 8, the vocab_size will be converted to 30528)
if config.vocab_size % 8 != 0:
config.vocab_size += 8 - (config.vocab_size % 8)
config.vocab_size = 30522
LogFile
/home/lcyx/.conda/envs/bert/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn( WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 0 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 3 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 1 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 2 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. 06/09/2022 14:28:00 - INFO - torch.distributed.distributed_c10d - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes. 06/09/2022 14:28:00 - INFO - main - device: cuda:0 n_gpu: 1, distributed training: True, 16-bits training: True 06/09/2022 14:28:00 - WARNING - main - Output directory (/data/scratch/bert_pretrain/glue2) already exists and is not empty. 06/09/2022 14:28:00 - INFO - main - device: cuda:3 n_gpu: 1, distributed training: True, 16-bits training: True 06/09/2022 14:28:00 - INFO - main - device: cuda:1 n_gpu: 1, distributed training: True, 16-bits training: True 06/09/2022 14:28:00 - INFO - main - device: cuda:2 n_gpu: 1, distributed training: True, 16-bits training: True DLL 2022-06-09 14:28:04.664127 - PARAMETER Config : ["Namespace(amp=False, bert_model='bert-base-uncased', config_file='/home/lcyx/Bert2/DeepLearningExamples/PyTorch/LanguageModeling/BERT/bert_config.json', data_dir='/home/lcyx/Bert2/DeepLearningExamples/PyTorch/LanguageModeling/BERT/MRPC', do_eval=True, do_lower_case=True, do_predict=False, do_train=True, eval_batch_size=16, fp16=True, gradient_accumulation_steps=1, init_checkpoint='/data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt', learning_rate=2.4e-05, local_rank=0, loss_scale=0, max_seq_length=128, max_steps=-1.0, no_cuda=False, num_train_epochs=3.0, output_dir='/data/scratch/bert_pretrain/glue2', seed=2, server_ip='', server_port='', skip_checkpoint=False, task_name='mrpc', train_batch_size=16, vocab_file='/home/lcyx/code/bert-base-uncased-vocab.txt', warmup_proportion=0.1)"] DLL 2022-06-09 14:28:04.664463 - PARAMETER SEED : 2 06/09/2022 14:28:04 - INFO - main - Loaded pre-processed features from /home/lcyx/Bert2/DeepLearningExamples/PyTorch/LanguageModeling/BERT/MRPC/bert-base-uncased_128_True 06/09/2022 14:28:04 - INFO - main - Loaded pre-processed features from /home/lcyx/Bert2/DeepLearningExamples/PyTorch/LanguageModeling/BERT/MRPC/bert-base-uncased_128_True 06/09/2022 14:28:04 - INFO - main - Loaded pre-processed features from /home/lcyx/Bert2/DeepLearningExamples/PyTorch/LanguageModeling/BERT/MRPC/bert-base-uncased_128_True 06/09/2022 14:28:04 - INFO - main - Loaded pre-processed features from /home/lcyx/Bert2/DeepLearningExamples/PyTorch/LanguageModeling/BERT/MRPC/bert-base-uncased_128_True 06/09/2022 14:28:05 - INFO - main - USING CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt 06/09/2022 14:28:05 - INFO - main - USING CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt 06/09/2022 14:28:05 - INFO - main - USING CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt 06/09/2022 14:28:06 - INFO - main - USING CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt 06/09/2022 14:28:06 - INFO - main - USED CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt 06/09/2022 14:28:06 - INFO - main - USED CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt 06/09/2022 14:28:06 - INFO - main - USED CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt DLL 2022-06-09 14:28:06.523055 - PARAMETER num_parameters : 109483778 06/09/2022 14:28:06 - INFO - main - using fp16 06/09/2022 14:28:06 - INFO - main - ***** Running training ***** 06/09/2022 14:28:06 - INFO - main - Num examples = 3668 06/09/2022 14:28:06 - INFO - main - Batch size = 16 06/09/2022 14:28:06 - INFO - main - Num steps = 171 06/09/2022 14:28:06 - INFO - main - using fp16 06/09/2022 14:28:06 - INFO - main - using fp16 Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
Defaults for this optimization level are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : True master_weights : True loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : False master_weights : True loss_scale : dynamic Epoch: 0%| | 0/3 [00:00<?, ?it/s] 06/09/2022 14:28:06 - INFO - main - ***** Running training ***** 06/09/2022 14:28:06 - INFO - main - Num examples = 3668 06/09/2022 14:28:06 - INFO - main - Batch size = 16 06/09/2022 14:28:06 - INFO - main - Num steps = 171 06/09/2022 14:28:06 - INFO - main - ***** Running training ***** 06/09/2022 14:28:06 - INFO - main - Num examples = 3668 06/09/2022 14:28:06 - INFO - main - Batch size = 16 06/09/2022 14:28:06 - INFO - main - Num steps = 171 06/09/2022 14:28:06 - INFO - main - USED CHECKPOINT from /data/scratch/bert_pretrain/exp5/results/checkpoints/ckpt_36000.pt Epoch: 0%| | 0/3 [00:00<?, ?it/s] 06/09/2022 14:28:06 - INFO - main - using fp16 06/09/2022 14:28:06 - INFO - main - ***** Running training ***** 06/09/2022 14:28:06 - INFO - main - Num examples = 3668 06/09/2022 14:28:06 - INFO - main - Batch size = 16 06/09/2022 14:28:06 - INFO - main - Num steps = 171 Epoch: 0%| | 0/3 [00:00<?, ?it/s] Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 Iteration: 100%|██████████| 58/58 [00:03<00:00, 18.14it/s] Iteration: 100%|██████████| 58/58 [00:03<00:00, 18.94it/s] Iteration: 100%|██████████| 58/58 [00:03<00:00, 18.35it/s] Iteration: 100%|██████████| 58/58 [00:03<00:00, 18.33it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.78it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.77it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.79it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.76it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.69it/s] Epoch: 100%|██████████| 3/3 [00:07<00:00, 2.47s/it]4it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.69it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.71it/s] Iteration: 100%|██████████| 58/58 [00:02<00:00, 26.69it/s] Epoch: 100%|██████████| 3/3 [00:07<00:00, 2.51s/it] Epoch: 100%|██████████| 3/3 [00:07<00:00, 2.50s/it] Epoch: 100%|██████████| 3/3 [00:07<00:00, 2.50s/it] 06/09/2022 14:28:15 - INFO - main - ***** Running evaluation ***** 06/09/2022 14:28:15 - INFO - main - Num examples = 408 06/09/2022 14:28:15 - INFO - main - Batch size = 16 Evaluating: 26it [00:00, 139.78it/s] 06/09/2022 14:28:15 - INFO - main - ***** Results ***** 06/09/2022 14:28:15 - INFO - main - acc = 0.6838235294117647 06/09/2022 14:28:15 - INFO - main - acc_and_f1 = 0.7480253018237863 06/09/2022 14:28:15 - INFO - main - eval:loss = 0.6253539598905123 06/09/2022 14:28:15 - INFO - main - eval:num_samples_per_gpu = 408 06/09/2022 14:28:15 - INFO - main - eval:num_steps = 26 06/09/2022 14:28:15 - INFO - main - f1 = 0.8122270742358079 06/09/2022 14:28:15 - INFO - main - global_step = 174 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):100% = 7.602047920227051 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):50% = 6.606847763061523 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):90% = 6.9673919677734375 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):95% = 6.9813761711120605 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):99% = 7.1487040519714355 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):avg = 6.689874428969163 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):std = 0.27139754700072705 06/09/2022 14:28:15 - INFO - main - infer:latency(ms):sum = 173.93673515319824 06/09/2022 14:28:15 - INFO - main - infer:throughput(samples/s):avg = 2391.674189087197 06/09/2022 14:28:15 - INFO - main - train:latency = 7.502729503903538 06/09/2022 14:28:15 - INFO - main - train:loss = 0.6459138223837162 06/09/2022 14:28:15 - INFO - main - train:num_samples_per_gpu = 2751 06/09/2022 14:28:15 - INFO - main - train:num_steps = 58 06/09/2022 14:28:15 - INFO - main - train:throughput = 1466.666230506486 DLL 2022-06-09 14:28:15.217692 - exact_match : 0.6838235294117647 DLL 2022-06-09 14:28:15.217769 - F1 : 0.8122270742358079 DLL 2022-06-09 14:28:15.217797 - e2e_train_time : 7.502729503903538 DLL 2022-06-09 14:28:15.217821 - training_sequences_per_second : 1466.666230506486 DLL 2022-06-09 14:28:15.217843 - e2e_inference_time : 0.17393673515319824 DLL 2022-06-09 14:28:15.217863 - inference_sequences_per_second : 2391.674189087197
Environment Please provide at least:
- Pytorch: 1.10.1
- GPUs in the system: 8x Nvidia A100:
- CUDA: 11.3
This seems like an incorrect checkpoint load, few keys are missing or mismatch.
Could you inspect the weight names in the model and the checkpoint?