Hiki comments

Results 3 comments of


                                            Hiki

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 1268782) of binary: /usr/bin/python3[BUG]:

@JThh Hello, I got the same error when I try to train 34b model in 3 nodes (8 * 40G， 500G Main Memory). I have seen the cpu memory usage...

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 1268782) of binary: /usr/bin/python3[BUG]:

Plus I set the tp = 8 in each node. So, I guess it maybe init the model 8 time.

[shardformer] support gradient accumulation for hybrid parallel plugin

@flybird11111 Hi, I didn't find the enable_gradient_accumulation and no_sync() in HybridParallelPlugin [https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/booster/plugin/hybrid_parallel_plugin.py](url). So I wonder how to add gradient accumulation in HybridParallelPlugin following [https://colossalai.org/docs/features/gradient_accumulation_with_booster](url). Can you provide more details?