Yzx835 comments

Results 5 comments of


                                            Yzx835

veRL-SGLang slower than expected (GH200)

same question, I tested both vLLM and SGLang. When using tensor_model_parallel_size=2, SGLang is much slower than vLLM. However, when tensor_model_parallel_size=1, their speeds are nearly the same.

WIP: FP8 train

> A draft for FP8 training. It currently depends on fsdp2 and torchao to train with per-tensor FP8 quantization. To enable it, install torchao and set strategy=fsdp2 and fsdp_config.fp8=True. Note:...

convert_to_float8_training and torch.compile make model slow

hello, @danielvegamyhre, I also test a bigger linear shape and model size: ```python # create model and sample input m = nn.Sequential( nn.Linear(16384, 16384), nn.Linear(16384, 16384), nn.Linear(16384, 16384), nn.Linear(16384, 16384),...

convert_to_float8_training and torch.compile make model slow

> Try filtering out the last linear `nn.Linear(16384, 128)` in your `module_filter_fn` which has such a small N dim that it will have a substantial slowdown with float8. That is...

convert_to_float8_training and torch.compile make model slow

@vkuzo Thank you! I tested the new code, which ignores the first few iterations. I did see a speedup. fp8 training, torch compile, Training time: 22.640191793441772 without fp8 training, torch...