Siddharth Singh

Results 8 comments of Siddharth Singh

Correctness check on 125M.yml with `use_axonn_model_parallelism:true`, `column_model_parallel_size=1`, `row_model_parallel_size=1`, `depth_model_parallel_size=2`, `model_parallel_size=2` on 2 GPUs. Dataset - enwiki8 (the loss curve is smoothed over 100 iterations) ![image](https://github.com/EleutherAI/gpt-neox/assets/16764680/31241adc-38aa-48ed-8b9b-f69407d47b7b)

@Quentin-Anthony I have updated the install instructions to install axonn from a fixed commit - 3ebc34c

@Quentin-Anthony Pushed some communication optimizations and also updated the instructions to install axonn from a newer commit - 45647ea. To enable these optimizations, you just need to set `"optimize_axonn_communication: true"`...

Hi @loadams, sorry I didn't have the bandwidth to investigate this issue further. I just chugged along with creating hostfiles named "hostfiles" and running one job at a time.

They are mostly identical. The megatron implementation is tightly wedded to Megatron-LM, so you cannot use it elsewhere easily. DS's implementation is modular, so you could parallelize other workloads outside...

@jwendlan can you update this in the same format as the current develop?

Thanks for the clarification. Is this behavior present for both the local and TE implementation or just for TE? Mcore inference solely uses the local implementation, hence my question.