Warning tensor dont match shape when tried to finetune 2B tiny model

Open khangnguyenhuu opened this issue 1 year ago • 1 comments

Hello i tried to finetune this model in my custom dataset and get this warning also in when training loss some step is 0, i have 2 questions:

what actually warning tell, and how to fix it?
How i can know about 0.0 loss, where it come from, my data or training command

{'loss': 0.0, 'learning_rate': 3.9622788631247045e-05, 'epoch': 0.13}

 13%|█▎        | 43/334 [27:16<3:02:22, 37.60s/it]514
warning: The size of tensor a (2017) must match the size of tensor b (3328) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([3328, 2048])
[2024-05-20 15:31:56,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1675.71 | bwd_microstep: 3747.74 | bwd_inner_microstep: 3747.72 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
1166
warning: The size of tensor a (2017) must match the size of tensor b (2304) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([2304, 2048])
[2024-05-20 15:32:01,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1361.25 | bwd_microstep: 3001.34 | bwd_inner_microstep: 3001.33 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.25
361
warning: The size of tensor a (2017) must match the size of tensor b (3328) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([3328, 2048])
[2024-05-20 15:32:06,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1676.71 | bwd_microstep: 3749.13 | bwd_inner_microstep: 3749.12 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
409
warning: The size of tensor a (2017) must match the size of tensor b (2304) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([2304, 2048])
[2024-05-20 15:32:11,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1360.70 | bwd_microstep: 2998.22 | bwd_inner_microstep: 2998.21 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
753
warning: The size of tensor a (2017) must match the size of tensor b (2304) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([2304, 2048])
[2024-05-20 15:32:15,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1363.06 | bwd_microstep: 2998.31 | bwd_inner_microstep: 2998.30 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
1823
warning: The size of tensor a (2017) must match the size of tensor b (3328) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([3328, 2048])
[2024-05-20 15:32:20,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1672.43 | bwd_microstep: 3747.11 | bwd_inner_microstep: 3747.09 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
1973
[2024-05-20 15:32:24,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1190.24 | bwd_microstep: 2601.76 | bwd_inner_microstep: 2601.75 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.15
889
warning: The size of tensor a (2017) must match the size of tensor b (2304) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([2304, 2048])
[2024-05-20 15:32:29,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 2.23 | optimizer_step: 3.52
[2024-05-20 15:32:29,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1358.78 | bwd_microstep: 3004.31 | bwd_inner_microstep: 3001.98 | bwd_allreduce_microstep: 2.29 | step_microstep: 49.58
[2024-05-20 15:32:29,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 11658.82 | bwd: 25847.93 | bwd_inner: 25845.50 | bwd_allreduce: 2.33 | step: 50.78

 13%|█▎        | 44/334 [27:53<3:01:55, 37.64s/it]
                                                  
{'loss': 0.0, 'learning_rate': 3.958425895241054e-05, 'epoch': 0.13}

 13%|█▎        | 44/334 [27:53<3:01:55, 37.64s/it]1794
warning: The size of tensor a (2017) must match the size of tensor b (3328) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([3328, 2048])
[2024-05-20 15:32:34,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1679.51 | bwd_microstep: 3745.75 | bwd_inner_microstep: 3745.74 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
2039
warning: The size of tensor a (2017) must match the size of tensor b (3328) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([3328, 2048])
[2024-05-20 15:32:39,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1676.08 | bwd_microstep: 3752.07 | bwd_inner_microstep: 3752.06 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.09
297
warning: The size of tensor a (2017) must match the size of tensor b (2304) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([2304, 2048])
[2024-05-20 15:32:44,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1363.66 | bwd_microstep: 2997.98 | bwd_inner_microstep: 2997.96 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
870
warning: The size of tensor a (2017) must match the size of tensor b (3328) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([3328, 2048])
[2024-05-20 15:32:49,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1679.09 | bwd_microstep: 3751.75 | bwd_inner_microstep: 3751.74 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.15
2660
warning: The size of tensor a (2017) must match the size of tensor b (3328) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([3328, 2048])
[2024-05-20 15:32:55,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1671.85 | bwd_microstep: 3739.99 | bwd_inner_microstep: 3739.98 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.10
2603
warning: The size of tensor a (2017) must match the size of tensor b (3328) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([3328, 2048])
[2024-05-20 15:33:00,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1680.99 | bwd_microstep: 3749.10 | bwd_inner_microstep: 3749.09 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.18
2514
warning: The size of tensor a (2017) must match the size of tensor b (2304) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([2304, 2048])
[2024-05-20 15:33:05,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1359.17 | bwd_microstep: 2998.35 | bwd_inner_microstep: 2998.34 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
1417
warning: The size of tensor a (2017) must match the size of tensor b (2304) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([2304, 2048])
[2024-05-20 15:33:09,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 2.33 | optimizer_step: 3.52
[2024-05-20 15:33:09,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1356.39 | bwd_microstep: 3005.79 | bwd_inner_microstep: 3003.42 | bwd_allreduce_microstep: 2.33 | step_microstep: 49.53
[2024-05-20 15:33:09,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 12466.69 | bwd: 27740.80 | bwd_inner: 27738.33 | bwd_allreduce: 2.37 | step: 50.60

 13%|█▎        | 45/334 [28:34<3:05:19, 38.48s/it]
                                                  
{'loss': 0.0, 'learning_rate': 3.954387660207733e-05, 'epoch': 0.13}

 13%|█▎        | 45/334 [28:34<3:05:19, 38.48s/it]408
warning: The size of tensor a (2017) must match the size of tensor b (2304) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([2304, 2048])
[2024-05-20 15:33:13,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1360.84 | bwd_microstep: 3002.02 | bwd_inner_microstep: 3002.01 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
1736
warning: The size of tensor a (2017) must match the size of tensor b (2304) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([2017, 2048]), vit_embeds.shape=torch.Size([2304, 2048])
[2024-05-20 15:33:18,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1362.05 | bwd_microstep: 2999.32 | bwd_inner_microstep: 2999.31 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.19
637
[2024-05-20 15:33:22,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1217.30 | bwd_microstep: 2690.86 | bwd_inner_microstep: 2690.85 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
1536
[2024-05-20 15:33:26,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1310.88 | bwd_microstep: 3023.22 | bwd_inner_microstep: 3023.20 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
1534
[2024-05-20 15:33:30,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1215.52 | bwd_microstep: 2680.89 | bwd_inner_microstep: 2680.87 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.11
1180
[2024-05-20 15:33:34,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1304.74 | bwd_microstep: 3028.13 | bwd_inner_microstep: 3028.12 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.15
1768
[2024-05-20 15:33:38,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1211.15 | bwd_microstep: 2712.15 | bwd_inner_microstep: 2712.14 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
1760
[2024-05-20 15:33:42,721] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152, reducing to 1048576
[2024-05-20 15:33:42,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1207.87 | bwd_microstep: 2690.65 | bwd_inner_microstep: 2688.30 | bwd_allreduce_microstep: 2.31 | step_microstep: 29.85
[2024-05-20 15:33:42,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 10190.26 | bwd: 22827.25 | bwd_inner: 22824.80 | bwd_allreduce: 2.35 | step: 30.99

 14%|█▍        | 46/334 [29:07<2:57:05, 36.89s/it]
                                                  
{'loss': 0.7845, 'learning_rate': 3.954387660207733e-05, 'epoch': 0.14}

 14%|█▍        | 46/334 [29:07<2:57:05, 36.89s/it]417
[2024-05-20 15:33:47,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1292.38 | bwd_microstep: 3003.39 | bwd_inner_microstep: 3003.37 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.11
1482
[2024-05-20 15:33:50,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1188.69 | bwd_microstep: 2717.73 | bwd_inner_microstep: 2717.72 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
1649
[2024-05-20 15:33:54,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1194.25 | bwd_microstep: 2728.94 | bwd_inner_microstep: 2728.93 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
1581
[2024-05-20 15:33:58,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1205.65 | bwd_microstep: 2757.53 | bwd_inner_microstep: 2757.52 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.15
1438
[2024-05-20 15:34:02,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1206.80 | bwd_microstep: 2758.65 | bwd_inner_microstep: 2758.64 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.10
1756
[2024-05-20 15:34:07,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1290.06 | bwd_microstep: 2989.33 | bwd_inner_microstep: 2989.32 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
1528
[2024-05-20 15:34:11,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1283.86 | bwd_microstep: 2972.24 | bwd_inner_microstep: 2972.23 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.15
1539
[2024-05-20 15:34:15,744] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576, reducing to 524288
[2024-05-20 15:34:15,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1284.04 | bwd_microstep: 2966.13 | bwd_inner_microstep: 2963.76 | bwd_allreduce_microstep: 2.33 | step_microstep: 29.50
[2024-05-20 15:34:15,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 9945.62 | bwd: 22893.96 | bwd_inner: 22891.49 | bwd_allreduce: 2.37 | step: 30.54

 14%|█▍        | 47/334 [29:40<2:50:55, 35.73s/it]
                                                  
{'loss': 0.8672, 'learning_rate': 3.954387660207733e-05, 'epoch': 0.14}

 14%|█▍        | 47/334 [29:40<2:50:55, 35.73s/it]481
[2024-05-20 15:34:19,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1187.01 | bwd_microstep: 2698.06 | bwd_inner_microstep: 2698.05 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.11
1718
[2024-05-20 15:34:23,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1198.16 | bwd_microstep: 2725.49 | bwd_inner_microstep: 2725.48 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
1584
[2024-05-20 15:34:27,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1281.38 | bwd_microstep: 2959.38 | bwd_inner_microstep: 2959.36 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.15
1608
[2024-05-20 15:34:31,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1195.85 | bwd_microstep: 2735.35 | bwd_inner_microstep: 2735.33 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
419
[2024-05-20 15:34:35,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1195.77 | bwd_microstep: 2737.19 | bwd_inner_microstep: 2737.18 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.15
602
[2024-05-20 15:34:39,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1191.89 | bwd_microstep: 2720.99 | bwd_inner_microstep: 2720.97 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
1425
[2024-05-20 15:34:43,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1271.59 | bwd_microstep: 2941.15 | bwd_inner_microstep: 2941.13 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.16
1471
[2024-05-20 15:34:47,784] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288, reducing to 262144
[2024-05-20 15:34:47,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1156.95 | bwd_microstep: 2657.62 | bwd_inner_microstep: 2655.15 | bwd_allreduce_microstep: 2.43 | step_microstep: 29.56
[2024-05-20 15:34:47,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 9678.49 | bwd: 22175.24 | bwd_inner: 22172.67 | bwd_allreduce: 2.47 | step: 30.69

 14%|█▍        | 48/334 [30:12<2:45:02, 34.63s/it]
                                                  
{'loss': 0.9806, 'learning_rate': 3.954387660207733e-05, 'epoch': 0.14}

My data

{"id": 0, "image": "OCR_New_Template_KFC_Result/2.jpg", "conversations": [{"from": "human", "value": "Picture 1: <image>\nextract data from this receipt, return only json without any additional text or Markdown formatting with the following keys:`time`: timestamp indicating the time of this receipt's issuance,`type`: designation of the vendor. Return blank if you unsure about any information"}, {"from": "gpt", "value": "{'time': '12:02', 'type': 'kfc'}"}]}
{"id": 1, "image": "OCR_New_Template_KFC_Result/2.jpg", "conversations": [{"from": "human", "value": "Picture 1: <image>\nextract data from this receipt, return only json without any additional text or Markdown formatting with the following keys:`products`: A list of objects containing the following keys:,`product_total_money`: total amount payable for the product in this invoice,`product_vat`: VAT of the product purchased in this invoice,`product_amount`: quantity of the product purchased in this invoice,`product_code`: code of the product purchased in this invoice,`product_unit_price`: price per unit of the product purchased in this invoice,`product_original_price`: initial amount payable for the product in this invoice (before any discounts or taxes),`product_total_original_money`: total initial amount payable for the product in this invoice (before any discounts or taxes),`product_discount_money`: discount amount for the product in this invoice,`product_discount_wholesale_money`: wholesale discount amount for the product,`product_discount_retail_money`: retail discount amount for the product,`product_name`: designation of the product purchased in this invoice,`type`: designation of the vendor,`name`: title of the shop where the receipt is generated,`tax_number`: tax identifier printed on this receipt,`date`: date of this receipt's issuance,`total_quantity`: total count of products in this invoice,`staff`: designation or ID of the staff involved in generating this invoice, as printed on this invoice. Return blank if you unsure about any information"}, {"from": "gpt", "value": "{'products': [{'product_code': '', 'product_name': 'Ga Que Kem A', 'product_vat': '', 'product_amount': '', 'product_unit_price': '', 'product_total_money': '86000', 'product_original_price': '', 'product_total_original_money': '', 'product_discount_money': '', 'product_discount_retail_money': '', 'product_discount_wholesale_money': ''}, {'product_code': '', 'product_name': 'Rice Skewer (2 @45,000.00)', 'product_vat': '', 'product_amount': '', 'product_unit_price': '', 'product_total_money': '90000', 'product_original_price': '', 'product_total_original_money': '', 'product_discount_money': '', 'product_discount_retail_money': '', 'product_discount_wholesale_money': ''}, {'product_code': '', 'product_name': 'Popcorn R', 'product_vat': '', 'product_amount': '', 'product_unit_price': '', 'product_total_money': '38000', 'product_original_price': '', 'product_total_original_money': '', 'product_discount_money': '', 'product_discount_retail_money': '', 'product_discount_wholesale_money': ''}, {'product_code': '', 'product_name': 'Add 3 Nuggets', 'product_vat': '', 'product_amount': '', 'product_unit_price': '', 'product_total_money': '22000', 'product_original_price': '', 'product_total_original_money': '', 'product_discount_money': '', 'product_discount_retail_money': '', 'product_discount_wholesale_money': ''}], 'type': 'kfc', 'name': 'DUONG NGUYEN THAI HOC', 'tax_number': '', 'date': '07/10/2023', 'total_quantity': '', 'staff': 'Ngo Thu HUYNH Y NHI'}"}]}
{"id": 2, "image": "OCR_New_Template_KFC_Result/2.jpg", "conversations": [{"from": "human", "value": "Picture 1: <image>\nextract data from this receipt, return only json without any additional text or Markdown formatting with the following keys:`date`: date export this receipt,`total_quantity`: total quantity of the product in this invoice,`tax_number`: tax number printed on this receipt. Return blank if you unsure about any information"}, {"from": "gpt", "value": "{'date': '07/10/2023', 'total_quantity': '', 'tax_number': ''}"}]}
{"id": 3, "image": "OCR_New_Template_KFC_Result/4.jpg", "conversations": [{"from": "human", "value": "Picture 1: <image>\nextract data from this receipt, return only json without any additional text or Markdown formatting with the following keys:\n`name`: title of the shop where the receipt is generated\n`staff`: designation or ID of the staff involved in generating this invoice, as printed on this invoice\n`type`: designation of the vendor\n`total_quantity`: total count of products in this invoice\nReturn blank if you unsure about any information"}, {"from": "gpt", "value": "{'name': 'KFC AEON TAN PHU', 'staff': 'TRAN PHAM NGOC NHI', 'type': 'kfc', 'total_quantity': ''}"}]}
{"id": 4, "image": "OCR_New_Template_KFC_Result/4.jpg", "conversations": [{"from": "human", "value": "Picture 1: <image>\nextract data from this receipt, return only json without any additional text or Markdown formatting with the following keys:`total_quantity`: total quantity of products in this invoice,`staff`: name or ID of the staff exporting this invoice, printed in this invoice,`time`: time of exporting this receipt, in timestamp format,`products`: A list of objects containing the following keys:,`product_total_money`: total amount to be paid for the product in this invoice,`product_name`: name of the product purchased in this invoice,`product_original_price`: original amount to be paid for the product in this invoice (before any discounts or taxes),`product_amount`: amount of the product purchased in this invoice,`product_unit_price`: price per unit of the product purchased in this invoice,`product_total_original_money`: total original amount to be paid for the product in this invoice (before any discounts or taxes),`date`: date of exporting this receipt,`total_money`: total amount in this invoice. Return blank if you unsure about any information"}, {"from": "gpt", "value": "{'total_quantity': '', 'staff': 'TRAN PHAM NGOC NHI', 'time': '10:41', 'products': [{'product_name': 'Lunch-R.Flava', 'product_amount': '', 'product_unit_price': '', 'product_total_money': '39000', 'product_original_price': '', 'product_total_original_money': ''}, {'product_name': 'Marker Lunch-R.Flava', 'product_amount': '', 'product_unit_price': '', 'product_total_money': '5000', 'product_original_price': '', 'product_total_original_money': ''}, {'product_name': '3 Pumcheese', 'product_amount': '', 'product_unit_price': '', 'product_total_money': '38000', 'product_original_price': '', 'product_total_original_money': ''}], 'date': '09/10/2023', 'total_money': '82000'}"}]}
{"id": 5, "image": "OCR_New_Template_KFC_Result/4.jpg", "conversations": [{"from": "human", "value": "Picture 1: <image>\nextract data from this receipt, return only json without any additional text or Markdown formatting with the following keys:`type`: designation of the vendor,`tax_number`: tax identifier printed on this receipt,`products`: A list of objects containing the following keys:,`product_unit_price`: price per unit of the product purchased in this invoice,`product_original_price`: initial amount payable for the product in this invoice (before any discounts or taxes),`product_amount`: quantity of the product purchased in this invoice,`product_discount_retail_money`: retail discount amount for the product,`product_total_original_money`: total initial amount payable for the product in this invoice (before any discounts or taxes),`product_code`: code of the product purchased in this invoice,`product_discount_price`: price after applying discount for the product,`product_name`: designation of the product purchased in this invoice,`product_total_money`: total amount payable for the product in this invoice,`product_vat`: VAT of the product purchased in this invoice,`product_discount_wholesale_money`: wholesale discount amount for the product,`product_discount_money`: discount amount for the product in this invoice,`total_money`: aggregate sum of money in this invoice,`date`: date of this receipt's issuance,`name`: title of the shop where the receipt is generated,`time`: timestamp indicating the time of this receipt's issuance,`staff`: designation or ID of the staff involved in generating this invoice, as printed on this invoice,`receipt_number`: unique identifier of this receipt. Return blank if you unsure about any information"}, {"from": "gpt", "value": "{'type': 'kfc', 'tax_number': '', 'products': [{'product_code': '', 'product_name': 'Lunch-R.Flava', 'product_vat': '', 'product_amount': '', 'product_unit_price': '', 'product_total_money': '39000', 'product_original_price': '', 'product_total_original_money': '', 'product_discount_money': '', 'product_discount_price': '', 'product_discount_retail_money': '', 'product_discount_wholesale_money': ''}, {'product_code': '', 'product_name': 'Marker Lunch-R.Flava', 'product_vat': '', 'product_amount': '', 'product_unit_price': '', 'product_total_money': '5000', 'product_original_price': '', 'product_total_original_money': '', 'product_discount_money': '', 'product_discount_price': '', 'product_discount_retail_money': '', 'product_discount_wholesale_money': ''}, {'product_code': '', 'product_name': '3 Pumcheese', 'product_vat': '', 'product_amount': '', 'product_unit_price': '', 'product_total_money': '38000', 'product_original_price': '', 'product_total_original_money': '', 'product_discount_money': '', 'product_discount_price': '', 'product_discount_retail_money': '', 'product_discount_wholesale_money': ''}], 'total_money': '82000', 'date': '09/10/2023', 'name': 'KFC AEON TAN PHU', 'time': '10:41', 'staff': 'TRAN PHAM NGOC NHI', 'receipt_number': ''}"}]}
.
.
.

My training command

deepspeed internvl/train/internvl_chat_finetune.py \
  --model_name_or_path "OpenGVLab/Mini-InternVL-Chat-2B-V1-5" \
  --conv_style "internlm2-chat" \
  --output_dir "output" \
  --meta_path "playground/train_data_internvl/meta_train.json" \
  --overwrite_output_dir True \
  --force_image_size 448 \
  --max_dynamic_patch 12 \
  --down_sample_ratio 0.5 \
  --drop_path_rate 0.1 \
  --pad2square False \
  --freeze_llm True \
  --freeze_mlp True \
  --freeze_backbone True \
  --use_backbone_lora 16 \
  --use_llm_lora 16 \
  --vision_select_layer -1 \
  --use_data_resampling False \
  --dataloader_num_workers 1 \
  --fp16 True \
  --num_train_epochs 1 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --evaluation_strategy "no" \
  --save_strategy "steps" \
  --save_steps 200 \
  --save_total_limit 3 \
  --learning_rate 1e-5 \
  --weight_decay 0.01 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type "cosine" \
  --logging_steps 1 \
  --max_seq_length 2048 \
  --do_train True \
  --grad_checkpoint True \
  --group_by_length True \
  --dynamic_image_size True \
  --use_thumbnail True \
  --ps_version 'v2' \
  --deepspeed "zero_stage1_config.json" \
  --report_to "tensorboard" \
  2>&1 | tee -a "${OUTPUT_DIR}/training_log.txt"

Thanks for concern my issuse

May 20 '24 15:05 khangnguyenhuu

it's seem like my max_seq_length config dont fit with the token create from vision encoder and token from input from (token from vision encoder + token from input prompt > max_seq_length), i have reduced the max_dynamic_patch to 6 and increase max_seq_length to 3072 and the loss is normally calculate

[2024-05-22 10:13:18,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 501.41 | bwd_microstep: 1140.38 | bwd_inner_microstep: 1140.36 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.07
580
[2024-05-22 10:13:19,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 500.66 | bwd_microstep: 1138.88 | bwd_inner_microstep: 1138.87 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
3606
[2024-05-22 10:13:21,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 502.03 | bwd_microstep: 1133.16 | bwd_inner_microstep: 1133.14 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.06
5582
[2024-05-22 10:13:23,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 504.22 | bwd_microstep: 1140.17 | bwd_inner_microstep: 1140.16 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
375
[2024-05-22 10:13:24,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 490.07 | bwd_microstep: 1100.55 | bwd_inner_microstep: 1100.54 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
8520
[2024-05-22 10:13:26,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 491.61 | bwd_microstep: 1104.29 | bwd_inner_microstep: 1104.28 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
5102
[2024-05-22 10:13:28,039] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048
[2024-05-22 10:13:28,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 500.44 | bwd_microstep: 1130.93 | bwd_inner_microstep: 1128.07 | bwd_allreduce_microstep: 2.81 | step_microstep: 30.67
[2024-05-22 10:13:28,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3989.44 | bwd: 9025.53 | bwd_inner: 9022.58 | bwd_allreduce: 2.86 | step: 31.53
{'loss': 0.9967, 'learning_rate': 1.7364011282313732e-05, 'epoch': 0.79}
 26%|██▋       | 1240/4686 [8:26:19<12:54:36, 13.49s/it]12245
[2024-05-22 10:13:29,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 495.85 | bwd_microstep: 1138.39 | bwd_inner_microstep: 1138.37 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
3591
[2024-05-22 10:13:31,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 500.84 | bwd_microstep: 1138.85 | bwd_inner_microstep: 1138.84 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
2336
[2024-05-22 10:13:33,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 498.36 | bwd_microstep: 1136.27 | bwd_inner_microstep: 1136.26 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
6322
[2024-05-22 10:13:34,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 504.01 | bwd_microstep: 1135.85 | bwd_inner_microstep: 1135.84 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
4730
[2024-05-22 10:13:36,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 482.56 | bwd_microstep: 1080.04 | bwd_inner_microstep: 1080.03 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
4810
[2024-05-22 10:13:37,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 485.65 | bwd_microstep: 1084.45 | bwd_inner_microstep: 1084.44 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
6046
[2024-05-22 10:13:39,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.56 | bwd_microstep: 1075.33 | bwd_inner_microstep: 1075.32 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
12008
[2024-05-22 10:13:41,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.02 | optimizer_gradients: 3.30 | optimizer_step: 5.63
[2024-05-22 10:13:41,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 485.32 | bwd_microstep: 1075.22 | bwd_inner_microstep: 1072.26 | bwd_allreduce_microstep: 2.91 | step_microstep: 54.33
[2024-05-22 10:13:41,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3934.02 | bwd: 8864.41 | bwd_inner: 8861.35 | bwd_allreduce: 2.96 | step: 55.35
{'loss': 0.1394, 'learning_rate': 1.7359333111243458e-05, 'epoch': 0.79}
 26%|██▋       | 1241/4686 [8:26:32<12:46:02, 13.34s/it]11392
[2024-05-22 10:13:42,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.43 | bwd_microstep: 1083.70 | bwd_inner_microstep: 1083.68 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
3426
[2024-05-22 10:13:44,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 484.28 | bwd_microstep: 1086.42 | bwd_inner_microstep: 1086.41 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
2626
[2024-05-22 10:13:45,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 480.38 | bwd_microstep: 1064.77 | bwd_inner_microstep: 1064.76 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
11991
[2024-05-22 10:13:47,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 484.58 | bwd_microstep: 1082.86 | bwd_inner_microstep: 1082.85 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
328
[2024-05-22 10:13:48,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 482.75 | bwd_microstep: 1076.80 | bwd_inner_microstep: 1076.79 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
11936
[2024-05-22 10:13:50,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 504.67 | bwd_microstep: 1124.15 | bwd_inner_microstep: 1124.14 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.10
4870
[2024-05-22 10:13:52,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 502.43 | bwd_microstep: 1140.04 | bwd_inner_microstep: 1140.03 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
4432
[2024-05-22 10:13:53,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.29 | optimizer_step: 5.63
[2024-05-22 10:13:53,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 502.50 | bwd_microstep: 1138.18 | bwd_inner_microstep: 1135.14 | bwd_allreduce_microstep: 2.99 | step_microstep: 54.69
[2024-05-22 10:13:53,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3920.90 | bwd: 8796.93 | bwd_inner: 8793.79 | bwd_allreduce: 3.03 | step: 55.69
{'loss': 0.1607, 'learning_rate': 1.735465142399873e-05, 'epoch': 0.79}
 27%|██▋       | 1242/4686 [8:26:45<12:38:35, 13.22s/it]12221
[2024-05-22 10:13:55,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 501.04 | bwd_microstep: 1126.24 | bwd_inner_microstep: 1126.23 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
5627
[2024-05-22 10:13:57,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 505.20 | bwd_microstep: 1121.28 | bwd_inner_microstep: 1121.26 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.15
4843
[2024-05-22 10:13:58,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 499.69 | bwd_microstep: 1124.76 | bwd_inner_microstep: 1124.74 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
6141
[2024-05-22 10:14:00,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 501.66 | bwd_microstep: 1120.81 | bwd_inner_microstep: 1120.80 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
6254
[2024-05-22 10:14:02,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 503.34 | bwd_microstep: 1119.73 | bwd_inner_microstep: 1119.72 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
4993
[2024-05-22 10:14:03,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 504.51 | bwd_microstep: 1136.40 | bwd_inner_microstep: 1136.38 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
6432
[2024-05-22 10:14:05,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 502.29 | bwd_microstep: 1140.95 | bwd_inner_microstep: 1140.94 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
6425
[2024-05-22 10:14:07,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 5.64
[2024-05-22 10:14:07,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 502.16 | bwd_microstep: 1133.68 | bwd_inner_microstep: 1130.76 | bwd_allreduce_microstep: 2.87 | step_microstep: 53.93
[2024-05-22 10:14:07,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4019.79 | bwd: 9023.84 | bwd_inner: 9020.83 | bwd_allreduce: 2.91 | step: 54.99
{'loss': 0.0108, 'learning_rate': 1.734996622281639e-05, 'epoch': 0.8}
 27%|██▋       | 1243/4686 [8:26:58<12:38:50, 13.22s/it]5928
[2024-05-22 10:14:08,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 500.11 | bwd_microstep: 1121.37 | bwd_inner_microstep: 1121.36 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
1927
[2024-05-22 10:14:10,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 501.12 | bwd_microstep: 1119.33 | bwd_inner_microstep: 1119.32 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
12456
[2024-05-22 10:14:12,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 489.86 | bwd_microstep: 1104.08 | bwd_inner_microstep: 1104.07 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.08
12458
[2024-05-22 10:14:13,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.45 | bwd_microstep: 1102.53 | bwd_inner_microstep: 1102.51 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
3916
[2024-05-22 10:14:15,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 490.27 | bwd_microstep: 1101.87 | bwd_inner_microstep: 1101.85 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
5683
[2024-05-22 10:14:16,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.44 | bwd_microstep: 1098.80 | bwd_inner_microstep: 1098.79 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.10
6015
[2024-05-22 10:14:18,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 489.17 | bwd_microstep: 1077.52 | bwd_inner_microstep: 1077.51 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.09
3663
[2024-05-22 10:14:20,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 5.63
[2024-05-22 10:14:20,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.50 | bwd_microstep: 1082.86 | bwd_inner_microstep: 1079.92 | bwd_allreduce_microstep: 2.90 | step_microstep: 53.97
[2024-05-22 10:14:20,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3935.80 | bwd: 8808.38 | bwd_inner: 8805.33 | bwd_allreduce: 2.94 | step: 54.87
{'loss': 0.9496, 'learning_rate': 1.734527750993495e-05, 'epoch': 0.8}
 27%|██▋       | 1244/4686 [8:27:11<12:33:44, 13.14s/it]4227
[2024-05-22 10:14:21,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 485.66 | bwd_microstep: 1095.62 | bwd_inner_microstep: 1095.61 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
4100
[2024-05-22 10:14:23,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 499.37 | bwd_microstep: 1141.14 | bwd_inner_microstep: 1141.12 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
3675
[2024-05-22 10:14:25,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 499.07 | bwd_microstep: 1132.15 | bwd_inner_microstep: 1132.13 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
5441
[2024-05-22 10:14:26,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 501.52 | bwd_microstep: 1142.54 | bwd_inner_microstep: 1142.53 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
3636
[2024-05-22 10:14:28,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 502.79 | bwd_microstep: 1127.50 | bwd_inner_microstep: 1127.48 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
5402
[2024-05-22 10:14:29,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.47 | bwd_microstep: 1093.74 | bwd_inner_microstep: 1093.72 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
3105
[2024-05-22 10:14:31,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.21 | bwd_microstep: 1101.28 | bwd_inner_microstep: 1101.27 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
5599
[2024-05-22 10:14:33,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.28 | optimizer_step: 5.63
[2024-05-22 10:14:33,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.39 | bwd_microstep: 1105.71 | bwd_inner_microstep: 1102.77 | bwd_allreduce_microstep: 2.90 | step_microstep: 54.21
[2024-05-22 10:14:33,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3953.35 | bwd: 8939.68 | bwd_inner: 8936.64 | bwd_allreduce: 2.94 | step: 55.20
{'loss': 0.3419, 'learning_rate': 1.7340585287594605e-05, 'epoch': 0.8}
 27%|██▋       | 1245/4686 [8:27:24<12:32:50, 13.13s/it]4745
[2024-05-22 10:14:34,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 484.26 | bwd_microstep: 1100.60 | bwd_inner_microstep: 1100.59 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
4420
[2024-05-22 10:14:36,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 502.64 | bwd_microstep: 1135.23 | bwd_inner_microstep: 1135.21 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
4920
[2024-05-22 10:14:38,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 498.51 | bwd_microstep: 1132.17 | bwd_inner_microstep: 1132.15 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.11
3772
[2024-05-22 10:14:39,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 500.09 | bwd_microstep: 1126.12 | bwd_inner_microstep: 1126.11 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
531
[2024-05-22 10:14:41,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 499.65 | bwd_microstep: 1136.88 | bwd_inner_microstep: 1136.86 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
4743
[2024-05-22 10:14:43,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 503.40 | bwd_microstep: 1136.14 | bwd_inner_microstep: 1136.13 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.11
3891
[2024-05-22 10:14:44,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 500.35 | bwd_microstep: 1141.60 | bwd_inner_microstep: 1141.58 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
4614
[2024-05-22 10:14:46,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.27 | optimizer_step: 5.61
[2024-05-22 10:14:46,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 502.20 | bwd_microstep: 1138.37 | bwd_inner_microstep: 1135.44 | bwd_allreduce_microstep: 2.90 | step_microstep: 54.06
[2024-05-22 10:14:46,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3990.98 | bwd: 9047.11 | bwd_inner: 9044.07 | bwd_allreduce: 2.94 | step: 55.06
{'loss': 0.5709, 'learning_rate': 1.7335889558037223e-05, 'epoch': 0.8}
 27%|██▋       | 1246/4686 [8:27:37<12:34:33, 13.16s/it]4543
[2024-05-22 10:14:48,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 499.51 | bwd_microstep: 1140.15 | bwd_inner_microstep: 1140.14 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.16
12343
[2024-05-22 10:14:49,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 502.74 | bwd_microstep: 1137.47 | bwd_inner_microstep: 1137.46 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.10
11169
[2024-05-22 10:14:51,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 498.90 | bwd_microstep: 1135.13 | bwd_inner_microstep: 1135.11 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.09
11851
[2024-05-22 10:14:53,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 502.34 | bwd_microstep: 1138.09 | bwd_inner_microstep: 1138.07 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
4824
[2024-05-22 10:14:54,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 502.04 | bwd_microstep: 1138.23 | bwd_inner_microstep: 1138.22 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.10
11879
[2024-05-22 10:14:56,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 499.29 | bwd_microstep: 1133.76 | bwd_inner_microstep: 1133.75 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
2106
[2024-05-22 10:14:58,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 502.44 | bwd_microstep: 1145.62 | bwd_inner_microstep: 1145.61 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
4907
[2024-05-22 10:14:59,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.27 | optimizer_step: 5.62
[2024-05-22 10:14:59,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 497.34 | bwd_microstep: 1139.85 | bwd_inner_microstep: 1136.90 | bwd_allreduce_microstep: 2.90 | step_microstep: 54.19
[2024-05-22 10:14:59,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4004.48 | bwd: 9108.30 | bwd_inner: 9105.25 | bwd_allreduce: 2.95 | step: 55.14
{'loss': 1.0622, 'learning_rate': 1.7331190323506352e-05, 'epoch': 0.8}
 27%|██▋       | 1247/4686 [8:27:51<12:36:56, 13.21s/it]4099
[2024-05-22 10:15:01,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 497.86 | bwd_microstep: 1127.72 | bwd_inner_microstep: 1127.71 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.10
5846
[2024-05-22 10:15:03,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 500.94 | bwd_microstep: 1135.31 | bwd_inner_microstep: 1135.30 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.11
5847
[2024-05-22 10:15:04,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 496.23 | bwd_microstep: 1139.22 | bwd_inner_microstep: 1139.20 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
4615
[2024-05-22 10:15:06,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 489.46 | bwd_microstep: 1090.66 | bwd_inner_microstep: 1090.65 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
4444
[2024-05-22 10:15:07,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.10 | bwd_microstep: 1095.18 | bwd_inner_microstep: 1095.16 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.10
3998
[2024-05-22 10:15:09,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 491.24 | bwd_microstep: 1098.76 | bwd_inner_microstep: 1098.74 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
6995
[2024-05-22 10:15:11,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 489.56 | bwd_microstep: 1088.61 | bwd_inner_microstep: 1088.59 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
5747
[2024-05-22 10:15:12,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.29 | optimizer_step: 5.63
[2024-05-22 10:15:12,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.58 | bwd_microstep: 1087.64 | bwd_inner_microstep: 1084.70 | bwd_allreduce_microstep: 2.90 | step_microstep: 53.92
[2024-05-22 10:15:12,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3941.85 | bwd: 8863.10 | bwd_inner: 8860.06 | bwd_allreduce: 2.93 | step: 54.87
{'loss': 1.0883, 'learning_rate': 1.7326487586247212e-05, 'epoch': 0.8}
 27%|██▋       | 1248/4686 [8:28:04<12:33:18, 13.15s/it]6256
[2024-05-22 10:15:14,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 485.75 | bwd_microstep: 1095.03 | bwd_inner_microstep: 1095.01 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
5650
[2024-05-22 10:15:16,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 503.16 | bwd_microstep: 1142.83 | bwd_inner_microstep: 1142.81 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
10011
[2024-05-22 10:15:17,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 501.14 | bwd_microstep: 1138.12 | bwd_inner_microstep: 1138.11 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.09
9948
[2024-05-22 10:15:19,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 498.01 | bwd_microstep: 1136.95 | bwd_inner_microstep: 1136.94 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
9925
[2024-05-22 10:15:21,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 506.23 | bwd_microstep: 1113.83 | bwd_inner_microstep: 1113.82 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
10019
[2024-05-22 10:15:22,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 500.48 | bwd_microstep: 1116.01 | bwd_inner_microstep: 1116.00 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.15
10031
[2024-05-22 10:15:24,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 499.47 | bwd_microstep: 1139.35 | bwd_inner_microstep: 1139.33 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
5001
[2024-05-22 10:15:26,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.31 | optimizer_step: 5.62
[2024-05-22 10:15:26,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 502.56 | bwd_microstep: 1137.87 | bwd_inner_microstep: 1134.97 | bwd_allreduce_microstep: 2.86 | step_microstep: 54.18
[2024-05-22 10:15:26,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3996.67 | bwd: 9019.99 | bwd_inner: 9016.98 | bwd_allreduce: 2.90 | step: 55.20
{'loss': 0.4941, 'learning_rate': 1.7321781348506697e-05, 'epoch': 0.8}
 27%|██▋       | 1249/4686 [8:28:17<12:34:22, 13.17s/it]9240
[2024-05-22 10:15:27,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 475.52 | bwd_microstep: 1083.65 | bwd_inner_microstep: 1083.63 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.11
7481
[2024-05-22 10:15:29,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 482.05 | bwd_microstep: 1075.92 | bwd_inner_microstep: 1075.90 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.07
9105
[2024-05-22 10:15:30,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.83 | bwd_microstep: 1075.57 | bwd_inner_microstep: 1075.55 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.09
2789
[2024-05-22 10:15:31,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.78 | bwd_microstep: 498.09 | bwd_inner_microstep: 498.08 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.05
913
[2024-05-22 10:15:32,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.47 | bwd_microstep: 393.31 | bwd_inner_microstep: 393.30 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.05
2090
[2024-05-22 10:15:32,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.59 | bwd_microstep: 386.20 | bwd_inner_microstep: 386.19 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.07
2903
[2024-05-22 10:15:33,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.38 | bwd_microstep: 391.45 | bwd_inner_microstep: 391.44 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.09
910
[2024-05-22 10:15:33,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.31 | optimizer_step: 5.63
[2024-05-22 10:15:33,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.69 | bwd_microstep: 391.97 | bwd_inner_microstep: 389.28 | bwd_allreduce_microstep: 2.66 | step_microstep: 54.02
[2024-05-22 10:15:33,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2361.17 | bwd: 5296.18 | bwd_inner: 5293.38 | bwd_allreduce: 2.69 | step: 54.67
{'loss': 1.445, 'learning_rate': 1.731707161253338e-05, 'epoch': 0.8}
 27%|██▋       | 1250/4686 [8:28:25<11:03:00, 11.58s/it]8862
[2024-05-22 10:15:38,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1381.03 | bwd_microstep: 2974.36 | bwd_inner_microstep: 2974.35 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.12
10083
[2024-05-22 10:15:43,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1615.02 | bwd_microstep: 3582.98 | bwd_inner_microstep: 3582.96 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
8026
[2024-05-22 10:15:49,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1737.33 | bwd_microstep: 3963.87 | bwd_inner_microstep: 3963.86 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.22
2450
[2024-05-22 10:15:54,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1577.50 | bwd_microstep: 3520.80 | bwd_inner_microstep: 3520.78 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.24
1279
[2024-05-22 10:15:59,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1660.93 | bwd_microstep: 3797.46 | bwd_inner_microstep: 3797.45 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.22
7606
[2024-05-22 10:16:05,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1647.96 | bwd_microstep: 3782.63 | bwd_inner_microstep: 3782.62 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.22
1485
[2024-05-22 10:16:09,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1439.31 | bwd_microstep: 3205.39 | bwd_inner_microstep: 3205.37 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
6734
[2024-05-22 10:16:14,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.26 | optimizer_step: 5.63
[2024-05-22 10:16:14,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1437.89 | bwd_microstep: 3210.28 | bwd_inner_microstep: 3207.17 | bwd_allreduce_microstep: 3.07 | step_microstep: 54.15
[2024-05-22 10:16:14,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 12496.85 | bwd: 28037.77 | bwd_inner: 28034.56 | bwd_allreduce: 3.11 | step: 55.65
{'loss': 0.2177, 'learning_rate': 1.7312358380577493e-05, 'epoch': 0.8}
 27%|██▋       | 1251/4686 [8:29:05<19:23:51, 20.33s/it]1459
[2024-05-22 10:16:19,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1602.96 | bwd_microstep: 3648.24 | bwd_inner_microstep: 3648.23 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
10836
[2024-05-22 10:16:25,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1574.49 | bwd_microstep: 3599.92 | bwd_inner_microstep: 3599.90 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.24
6890
[2024-05-22 10:16:30,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1536.99 | bwd_microstep: 3501.37 | bwd_inner_microstep: 3501.36 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
1822
[2024-05-22 10:16:34,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1414.80 | bwd_microstep: 3159.10 | bwd_inner_microstep: 3159.09 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.15
10855
[2024-05-22 10:16:39,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1405.54 | bwd_microstep: 3157.15 | bwd_inner_microstep: 3157.13 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.14
7073
[2024-05-22 10:16:44,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1527.10 | bwd_microstep: 3497.51 | bwd_inner_microstep: 3497.49 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.13
7493
[2024-05-22 10:16:48,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1391.43 | bwd_microstep: 3096.93 | bwd_inner_microstep: 3096.92 | bwd_allreduce_microstep: 0.00 | step_microstep: 0.15
2832
[2024-05-22 10:16:53,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 3.34 | optimizer_step: 5.62
[2024-05-22 10:16:53,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1386.35 | bwd_microstep: 3099.21 | bwd_inner_microstep: 3096.18 | bwd_allreduce_microstep: 2.98 | step_microstep: 54.13
[2024-05-22 10:16:53,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 11839.54 | bwd: 26759.43 | bwd_inner: 26756.31 | bwd_allreduce: 3.02 | step: 55.36
{'loss': 0.1264, 'learning_rate': 1.7307641654890942e-05, 'epoch': 0.8}
 27%|██▋       | 1252/4686 [8:29:44<24:40:50, 25.87s/it]

May 22 '24 10:05 khangnguyenhuu

Hello got the same error with loss=0 did you resolve it ?

Jun 25 '24 20:06 Mohamed-Dhouib

Hello got the same error with loss=0 did you resolve it ?

Yeah i have, this is solution for me

max_seq_length config dont fit with the token create from vision encoder and token from input prompt (token from vision encoder + token from input prompt > max_seq_length), i have reduced the max_dynamic_patch to 6 (to decrease token from vision encoder) and increase max_seq_length to 3072 and the loss is normal

Jun 26 '24 02:06 khangnguyenhuu