axlearn icon indicating copy to clipboard operation
axlearn copied to clipboard

Print step time for each step

Open samos123 opened this issue 1 year ago • 3 comments

This is helpful in cases where there is variable step time and looking at the logs would quickly allow you to identify such cases.

samos123 avatar Mar 09 '24 00:03 samos123

That makes sense. Should I hide it behind an option? For me it was important to troubleshoot a variable step time issue. Or would you rather totally leave this out of the code base. Note I'm fine with that too.

samos123 avatar Mar 09 '24 04:03 samos123

I've added it as a config parameter and made it false by default

samos123 avatar Mar 09 '24 06:03 samos123

I have been using this for 2 use cases:

  • Ensure step time is stable across steps. In the past on the GPU, the networking becomes unstable where collectives sometimes take longer between steps. E.g. step time varies between 5 second to 7 seconds.
  • Quickly evaluate whether there is a performance gain without having to wait for 100 steps

Please let me know if there is any interest in merging it @markblee

samos123 avatar May 13 '24 16:05 samos123