Qi Penghui comments

Results 13 comments of


                                            Qi Penghui

Create a miniversion containing only ZB-H1 and essential changes so other megatron forks can easily integrate

done in https://github.com/sail-sg/zero-bubble-pipeline-parallelism/tree/zb-h1-quick-start?tab=readme-ov-file

[QUESTION] IS zero bubble pp support flash-atten?

```I SEE zero-bubble-pipeline-parallelism disabled FusdLayerNorm``` I don't think we disable any such thing. And of course we support flash-attn.

[QUESTION]Is V-ZB still not support overlap_grad_reduce and transformer_engine?

transformer_engine is not supported because it's not accessible to split backward pass into weight gradient and activation gradient. For overlap_grad_reduce, I think it's already well supported. Please provide more information...

[QUESTION]Is V-ZB still not support overlap_grad_reduce and transformer_engine?

> Hello! [@QPHutu](https://github.com/QPHutu), > > May I ask which parts of Transformer Engine are causing issues? > > This would be very helpful for applying Zero Bubble Pipeline Parallel to...

[Bug]: Qwen/Qwen2.5-1.5B-Instruct generates out of vocabulary tokens

The same problem with DeepSeek-R1-Distill-Qwen-1.5B. ```ValueError: Token id 151869 is out of vocabulary``` > Qwen's tokenizer has special tokens that are not defined in the vocabulary. See [#11980](https://github.com/vllm-project/vllm/pull/11980) doesn't hlep....

[Bug]: Qwen/Qwen2.5-1.5B-Instruct generates out of vocabulary tokens

> Since a lot of people are commenting about this, here's a simple explanation for why this happens: > > Qwen and some other models come with a few hundred...

[QUESTION] Why is it necessary to use fp16 mode when optimizing with enable_optimizer_post_validation

I guess most people use FP16 for LLM training, so we only implement our code in [MixedPrecisionOptimizer](https://github.com/sail-sg/zero-bubble-pipeline-parallelism/blob/main/megatron/core/optimizer/optimizer.py#L507).

[QUESTION] overlap-grad-reduce in V-ZB

> As for how to ensure that all weights in weight_grad_buffers have executed the start_grad_sync operation within the WeightGradStore.clear() method？ > > As shown in the timeline below, for the...

[QUESTION] overlap-grad-reduce in V-ZB

> In the current version, they cancel the support of overlap_grad_reduce, which can be found in the file megatron.core.zbpp_utils.py. I feel so confused. I try to fix this in my...

[QUESTION] overlap-grad-reduce in V-ZB

> > > In the current version, they cancel the support of overlap_grad_reduce, which can be found in the file megatron.core.zbpp_utils.py. I feel so confused. I try to fix this...