Siddharth Singh
Siddharth Singh
WIP PR for pipeline parallelism Has convergence issues
This PR enables token dropping for full tensor parallelism. Also corrects timers. (Still WIP)
https://github.com/jettify/pytorch-optimizer/blob/910b414565427f0a66e20040475e7e4385e066a5/torch_optimizer/shampoo.py#L130 Shouldn't the second argument be `-0.5/order`? For example, with order 2, the authors raise the precondition matrices to the -1/4th power.
Steps to run - Install AxoNN (dependencies - Pytorch and mpi4py) - - git clone [email protected]:axonn-ai/axonn.git - cd axonn - git checkout 45647ea - pip install -e . Preparing a...
**Describe the bug** I am trying to launch multiple Megatron-DeepSpeed jobs on a slurm based cluster. For each job, I want to create a different hostfile called hostfile_${SLURM_JOBID}. However, when...
https://github.com/microsoft/DeepSpeedExamples/blob/737c6740bec38b77a24a59135b6481a53d566b38/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_log_output/opt-1.3b-globalBatchSize128.log#L4 Why is the PPL here 4k when we are starting with a pretrained model?
bfloat16 is the go-to datatype for mixed precision training of large neural networks. This PR aims to add bf-16 support in axonn
Why? 1. Reduce scatters - happen on weight gradients, and researchers increasingly want to do these in fp32. 2. All gathers - with torch.autocast, these were happening in fp32 by...
might be helpful for end users?