onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

Improve memory matrix for ORTModule

Open pengwa opened this issue 1 year ago • 0 comments

Memory matrix for ORTModule

Collect nvidia-smi and parameter/gradient/buffers sizes also. Exposed as a function, can be used externally for debugging purpose.

2024-02-23 09:24:54,828 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) | phase: pre_forward | nvm smi: 10550 | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 816 | param size: 5314 | grad size: 0 | buffer size: 8
2024-02-23 09:24:55,002 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) | phase: post_forward | nvm smi: 10550 | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 816 | param size: 5314 | grad size: 0 | buffer size: 8
2024-02-23 09:24:55,124 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) | phase: pre_backward | nvm smi: 10550 | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 816 | param size: 5314 | grad size: 0 | buffer size: 8
2024-02-23 09:24:55,327 orttraining.rank-0 [INFO] - rank-0 step 1 memory (MiB) | phase: post_backward | nvm smi: 10550 | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 851 | param size: 5314 | grad size: 12 | buffer size: 8
  0%|▏                                                                                                                                                                                                                                              | 2/3200 [01:28<32:39:35, 36.77s/it]2024-02-23 09:24:55,728 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) | phase: pre_forward | nvm smi: 10550 | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 851 | param size: 5314 | grad size: 0 | buffer size: 8
2024-02-23 09:24:55,908 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) | phase: post_forward | nvm smi: 10550 | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 851 | param size: 5314 | grad size: 0 | buffer size: 8
2024-02-23 09:24:56,031 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) | phase: pre_backward | nvm smi: 10550 | allocated: 8926 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 851 | param size: 5314 | grad size: 0 | buffer size: 8
2024-02-23 09:24:56,231 orttraining.rank-0 [INFO] - rank-0 step 2 memory (MiB) | phase: post_backward | nvm smi: 10550 | allocated: 6098 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 218 | max inactive: 851 | param size: 5314 | grad size: 12 | buffer size: 8
  0%|▏                                                                                                                                                                                                                                              | 3/3200 [01:29<18:06:11, 20.39s/it]2024-02-23 09:24:56,414 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) | phase: pre_forward | nvm smi: 10550 | allocated: 5331 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 219 | max inactive: 851 | param size: 5314 | grad size: 0 | buffer size: 8
2024-02-23 09:24:56,585 orttraining.rank-0 [INFO] - rank-0 step 3 memory (MiB) | phase: post_forward | nvm smi: 10550 | allocated: 8162 | max allocated: 9039 | cached: 9382 | max cached: 9382 | inactive: 400 | max inactive: 851 | param size: 5314 | grad size: 0 | buffer size: 8

pengwa avatar Feb 23 '24 09:02 pengwa