Apoorv Gupta issues

Results 6 issues of


                                            Apoorv Gupta

Smddp MNIST example update

*Issue #, if available:* *Description of changes:* Updated SMDDP MNIST training example with new APIs and information *Testing done:* Yes, tested on Sagemaker ## Merge Checklist _Put an `x` in...

Gradient Accumulation in Axlearn

Gradient accumulation allows training with higher batch sizes without scaling out. Added a new learner type ```learner.klass: 'axlearn.common.learner.AccumulatedLearner'``` At a high level the optimization does the following: 1. Input batch...

Neuron support in Axlearn

This PR enables use of neuron devices in Axlearn for model training. - Chooses correct mesh for TRN devices for Fuji 7B with the mesh selector flag ```--mesh_selector=neuron-trn1.32xlarge-64```

New DataPartitionType DATA

Increases memory efficiency during large scale training, input batches and labels are sharded along the 'data' axis. Added new input data sharding option ```DataPartitionType.DATA```.

Save O_PROJ on Fuji 70B-v2 for TRN2

Saving out-projection improves training throughput while still fitting in the mesh defined by `neuron-(trn2|trn2n).48xlarge-64`.

Fallback to standard mesh on neuron backend for incompatible multi-granule meshes

Allow fallback to standard mesh for multi-granule mesh as such a mesh provides better performance on TRN2 - Added corresponding tests for fallback and mesh creation for TRN2. - Switch...