axlearn
axlearn copied to clipboard
Print step time for each step
This is helpful in cases where there is variable step time and looking at the logs would quickly allow you to identify such cases.
That makes sense. Should I hide it behind an option? For me it was important to troubleshoot a variable step time issue. Or would you rather totally leave this out of the code base. Note I'm fine with that too.
I've added it as a config parameter and made it false by default
I have been using this for 2 use cases:
- Ensure step time is stable across steps. In the past on the GPU, the networking becomes unstable where collectives sometimes take longer between steps. E.g. step time varies between 5 second to 7 seconds.
- Quickly evaluate whether there is a performance gain without having to wait for 100 steps
Please let me know if there is any interest in merging it @markblee