Siddharth Singh issues

Results 25 issues of


                                            Siddharth Singh

Moe pipelining

WIP PR for pipeline parallelism Has convergence issues

MoE - Token dropping for Full Tensor Paralellism

This PR enables token dropping for full tensor parallelism. Also corrects timers. (Still WIP)

Implementation of Shampoo inconsistent with the paper

https://github.com/jettify/pytorch-optimizer/blob/910b414565427f0a66e20040475e7e4385e066a5/torch_optimizer/shampoo.py#L130 Shouldn't the second argument be `-0.5/order`? For example, with order 2, the authors raise the precondition matrices to the -1/4th power.

Adding AxoNN's 3D tensor parallelism [WIP]

Steps to run - Install AxoNN (dependencies - Pytorch and mpi4py) - - git clone [email protected]:axonn-ai/axonn.git - cd axonn - git checkout 45647ea - pip install -e . Preparing a...

feature request

[BUG] unable to use a hostfile with a name that is not "hostfile"

**Describe the bug** I am trying to launch multiple Megatron-DeepSpeed jobs on a slurm based cluster. For each job, I want to create a different hostfile called hostfile_${SLURM_JOBID}. However, when...

bug

training

Why is the PPL so high in the beginning of Step-1 (SFT)?

https://github.com/microsoft/DeepSpeedExamples/blob/737c6740bec38b77a24a59135b6481a53d566b38/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_log_output/opt-1.3b-globalBatchSize128.log#L4 Why is the PPL here 4k when we are starting with a pretrained model?

Siddharth Singh

Moe pipelining

MoE - Token dropping for Full Tensor Paralellism

Implementation of Shampoo inconsistent with the paper

Adding AxoNN's 3D tensor parallelism [WIP]

[BUG] unable to use a hostfile with a name that is not "hostfile"

Why is the PPL so high in the beginning of Step-1 (SFT)?

[WIP] add bfloat16 training support

Add communication dtypes for all-gathers and reduce scatters in depth tensor parallelism

Visualize topology of GPUs

Repair broke CI test logo