When the training reaches 1000 steps, the training program automatically stops and exits
I encountered a strange problem where the training program automatically stopped and exited when the training reached 1000 steps, and the terminal did not print any errors. The printed information is shown below:
... Step 0: grad_norm=22.8066, loss=2.4613, param_norm=1379.6764 Step 100: grad_norm=60.2230, loss=10.5329, param_norm=1379.6764 Step 200: grad_norm=51.7156, loss=7.9182, param_norm=1379.6764 Step 300: grad_norm=50.5966, loss=6.2026, param_norm=1379.6764 Step 400: grad_norm=51.8473, loss=3.8295, param_norm=1379.6764 Step 500: grad_norm=52.1892, loss=2.6872, param_norm=1379.6766 Step 600: grad_norm=41.1263, loss=1.5617, param_norm=1379.6770 Step 700: grad_norm=34.3095, loss=1.1549, param_norm=1379.6771 Step 800: grad_norm=29.3910, loss=0.9973, param_norm=1379.6777 Step 900: grad_norm=20.1167, loss=0.4780, param_norm=1379.6781 Step 1000: grad_norm=16.2561, loss=0.3569, param_norm=1379.6785 10%|████████▋ | 1000/10000 [07:39<1:10:19, 2.13it/s]16:45:02.057 [I] [process=0][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning. (37894:checkpoint_manager.py:1987) 16:45:02.058 [I] [process=0] Saving checkpoint at step 1000 (37894:checkpoint_manager.py:1408) 16:45:02.058 [I] [process=0] Started async saving checkpoint to /home/wm/code_of_Ken/openpi_all/openpi/checkpoints/pi0_aloha_fold_t_shirt/my_pi0_aloha_fold_t_shirt/1000. (37894:async_checkpointer.py:439) 16:45:02.058 [I] Using ThreadSafeKeyValueSignalingClient (37894:signaling_client.py:332) 16:45:02.065 [I] Creating tmp directory /home/wm/code_of_Ken/openpi_all/openpi/checkpoints/pi0_aloha_fold_t_shirt/my_pi0_aloha_fold_t_shirt/1000.orbax-checkpoint-tmp-0 (37894:atomicity.py:144) 16:45:02.066 [I] Wrote Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1761641102066009732, 'commit_timestamp_nsecs': None, 'custom_metadata': {}}, json={"item_handlers": null, "metrics": {}, "performance_metrics": {}, "init_timestamp_nsecs": 1761641102066009732, "commit_timestamp_nsecs": null, "custom_metadata": {}} to /home/wm/code_of_Ken/openpi_all/openpi/checkpoints/pi0_aloha_fold_t_shirt/my_pi0_aloha_fold_t_shirt/1000.orbax-checkpoint-tmp-0/_CHECKPOINT_METADATA (37894:checkpoint.py:186) 16:45:02.067 [I] Creating tmp directory /home/wm/code_of_Ken/openpi_all/openpi/checkpoints/pi0_aloha_fold_t_shirt/my_pi0_aloha_fold_t_shirt/1000.orbax-checkpoint-tmp-0/assets.orbax-checkpoint-tmp-1 (37894:atomicity.py:144) 16:45:02.067 [I] Creating tmp directory /home/wm/code_of_Ken/openpi_all/openpi/checkpoints/pi0_aloha_fold_t_shirt/my_pi0_aloha_fold_t_shirt/1000.orbax-checkpoint-tmp-0/params.orbax-checkpoint-tmp-2 (37894:atomicity.py:144) 16:45:02.068 [I] Creating tmp directory /home/wm/code_of_Ken/openpi_all/openpi/checkpoints/pi0_aloha_fold_t_shirt/my_pi0_aloha_fold_t_shirt/1000.orbax-checkpoint-tmp-0/train_state.orbax-checkpoint-tmp-3 (37894:atomicity.py:144) 16:45:02.089 [I] Transferring arrays to host memory with options: use_replica_parallel=True, enable_pinned_host_transfer=False (37894:replica_slices.py:341) 16:45:25.988 [I] Transferring arrays to host memory with options: use_replica_parallel=True, enable_pinned_host_transfer=False (37894:replica_slices.py:341) 16:45:26.000 [I] Array name: 'params.PaliGemma.llm.layers.mlp.gating_einsum.value', global shape: (18, 2, 2048, 16384), write shape: (18, 2, 2048, 16384), chosen chunk shape: (18, 2, 2048, 8192) (37894:tensorstore_utils.py:408) 16:45:26.445 [I] [process=0][thread=array_type_handler] Wrote 70 array_metadata.ArrayMetadata to /home/wm/code_of_Ken/openpi_all/openpi/checkpoints/pi0_aloha_fold_t_shirt/my_pi0_aloha_fold_t_shirt/1000.orbax-checkpoint-tmp-0/params.orbax-checkpoint-tmp-2/array_metadatas/process_0 (37894:array_metadata_store.py:198)
Have you encountered the same problem? Looking forward to your reply!
I have found the cause, it is due to insufficient computer memory, and the problem has been resolved.
At what point does the loss converge to yield the best model performance?
@sunmoon2018 I am also stuck on this. I have allocated about 90gb gpu ram, but it still fails exactly at saving checkpoints. Can I know how much compute resolved this for you? This is for full pi0 model and batch size 64 on 6000pro. Thanks in advance !