Yzx835
Yzx835
same question, I tested both vLLM and SGLang. When using tensor_model_parallel_size=2, SGLang is much slower than vLLM. However, when tensor_model_parallel_size=1, their speeds are nearly the same.
> A draft for FP8 training. It currently depends on fsdp2 and torchao to train with per-tensor FP8 quantization. To enable it, install torchao and set strategy=fsdp2 and fsdp_config.fp8=True. Note:...
hello, @danielvegamyhre, I also test a bigger linear shape and model size: ```python # create model and sample input m = nn.Sequential( nn.Linear(16384, 16384), nn.Linear(16384, 16384), nn.Linear(16384, 16384), nn.Linear(16384, 16384),...
> Try filtering out the last linear `nn.Linear(16384, 128)` in your `module_filter_fn` which has such a small N dim that it will have a substantial slowdown with float8. That is...
@vkuzo Thank you! I tested the new code, which ignores the first few iterations. I did see a speedup. fp8 training, torch compile, Training time: 22.640191793441772 without fp8 training, torch...