Qi Penghui
Qi Penghui
done in https://github.com/sail-sg/zero-bubble-pipeline-parallelism/tree/zb-h1-quick-start?tab=readme-ov-file
```I SEE zero-bubble-pipeline-parallelism disabled FusdLayerNorm``` I don't think we disable any such thing. And of course we support flash-attn.
transformer_engine is not supported because it's not accessible to split backward pass into weight gradient and activation gradient. For overlap_grad_reduce, I think it's already well supported. Please provide more information...
> Hello! [@QPHutu](https://github.com/QPHutu), > > May I ask which parts of Transformer Engine are causing issues? > > This would be very helpful for applying Zero Bubble Pipeline Parallel to...
The same problem with DeepSeek-R1-Distill-Qwen-1.5B. ```ValueError: Token id 151869 is out of vocabulary``` > Qwen's tokenizer has special tokens that are not defined in the vocabulary. See [#11980](https://github.com/vllm-project/vllm/pull/11980) doesn't hlep....
> Since a lot of people are commenting about this, here's a simple explanation for why this happens: > > Qwen and some other models come with a few hundred...
I guess most people use FP16 for LLM training, so we only implement our code in [MixedPrecisionOptimizer](https://github.com/sail-sg/zero-bubble-pipeline-parallelism/blob/main/megatron/core/optimizer/optimizer.py#L507).
> As for how to ensure that all weights in weight_grad_buffers have executed the start_grad_sync operation within the WeightGradStore.clear() method? > > As shown in the timeline below, for the...
> In the current version, they cancel the support of overlap_grad_reduce, which can be found in the file megatron.core.zbpp_utils.py. I feel so confused. I try to fix this in my...
> > > In the current version, they cancel the support of overlap_grad_reduce, which can be found in the file megatron.core.zbpp_utils.py. I feel so confused. I try to fix this...