training icon indicating copy to clipboard operation
training copied to clipboard

[BERT] Trying out fixes in init ckpt loading and input pipeline for large scale runs

Open sgpyc opened this issue 4 years ago • 4 comments

sgpyc avatar Sep 09 '21 09:09 sgpyc

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

github-actions[bot] avatar Sep 09 '21 09:09 github-actions[bot]

@sgpyc The changes appear to be orthogonal to the differences observed with running the reference at true scale vs with GA. Considering there is still an 8% difference between those code paths, it does not seem that the original issue is resolved. Only minimal changes that address existing open functional issues should be made this late in the v1.1 schedule.

nvcforster avatar Sep 22 '21 16:09 nvcforster

Consider this PR a WIP to for finding the cause(s) of convergence difference between at scale and GA. The two potential places are 1) whether to load optimizer slots when loading the init checkpoint (around L170), and 2) slight change in the input loader.

After further tests, these two changes move the convergence between at-scale (768 partitions) and GA (128 partitions & GA=6) closer, but still not the same, with BS=6912. GA=6 is the one with biggest convergence gap within GA={3, 6, 9, 18, 27}. High GA numbers, e.g. 128 partitions & GA=27 for BS6912, still show the same convergence behavior as the RCPs and submissions at scale.

I suggest do not merge, and use 128 partitions with GA to test RCP for large scale & large batch size runs, for now.

sgpyc avatar Sep 23 '21 11:09 sgpyc

Should we either merge or drop this PR?

johntran-nv avatar Sep 28 '22 16:09 johntran-nv