Qi Penghui

Results 13 comments of Qi Penghui

done in https://github.com/sail-sg/zero-bubble-pipeline-parallelism/tree/zb-h1-quick-start?tab=readme-ov-file

```I SEE zero-bubble-pipeline-parallelism disabled FusdLayerNorm``` I don't think we disable any such thing. And of course we support flash-attn.

transformer_engine is not supported because it's not accessible to split backward pass into weight gradient and activation gradient. For overlap_grad_reduce, I think it's already well supported. Please provide more information...

> Hello! [@QPHutu](https://github.com/QPHutu), > > May I ask which parts of Transformer Engine are causing issues? > > This would be very helpful for applying Zero Bubble Pipeline Parallel to...

The same problem with DeepSeek-R1-Distill-Qwen-1.5B. ```ValueError: Token id 151869 is out of vocabulary``` > Qwen's tokenizer has special tokens that are not defined in the vocabulary. See [#11980](https://github.com/vllm-project/vllm/pull/11980) doesn't hlep....

> Since a lot of people are commenting about this, here's a simple explanation for why this happens: > > Qwen and some other models come with a few hundred...

I guess most people use FP16 for LLM training, so we only implement our code in [MixedPrecisionOptimizer](https://github.com/sail-sg/zero-bubble-pipeline-parallelism/blob/main/megatron/core/optimizer/optimizer.py#L507).

> As for how to ensure that all weights in weight_grad_buffers have executed the start_grad_sync operation within the WeightGradStore.clear() method? > > As shown in the timeline below, for the...

> In the current version, they cancel the support of overlap_grad_reduce, which can be found in the file megatron.core.zbpp_utils.py. I feel so confused. I try to fix this in my...

> > > In the current version, they cancel the support of overlap_grad_reduce, which can be found in the file megatron.core.zbpp_utils.py. I feel so confused. I try to fix this...