is fused layernorm really better?
https://github.com/pytorch/pytorch/commit/8b87f9a5107e8b3c4f87d5297af698bb55838d81#diff-f12c726e3e8cd2b4768f8984fef27059
I think we don't need to use apex fused layernorm anymore. torch layernorm is better. what do you think about this?
cc. @stas00 @thomasw21
- python frontend: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/model/fused_layer_norm.py#L62
- kernel: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/fused_kernels/layer_norm_cuda_kernel.cu#L17
I'd say it's best to ask upstream at Meg-LM level, as surely they have benchmarked their code.
Perhaps @jaredcasper could answer your question.
Don't you plan to update this repo if the upstream isn't update?
Don't you plan to update this repo if the upstream isn't update?
This is not what I meant. I meant to first ask the original authors why they did it this way. Meg-LM is a highly optimized library, so often they have a good reason for why they are doing something in a certain way.
It'd be good for us to inquire before rushing to change things.
Specifically to this line of inquiry it appears that they don't use apex's fused layernorm, but a modified version of it. So while we know the original apex version is slower than pytorch's one, we don't know anything about the performance of their modified function unless you have benchmarked it already.
Just to clarify, I believe the only difference between the fused layer norm in Megatron's code vs APEX is in the types, but that could lead to pretty different performance depending on the types used in the benchmark if one required a separate cast operation.
We haven't benchmarked vs upstream torch in a while, it'd be interesting to know if theirs is faster. Apex/megatron was faster when we first started using it and we just haven't really visited it since then.
Yes. I think the only difference of Apex Layernorm and megatron's fused layernorm is type casting about bfloat16. I'll test bfloat16 with torch.nn.Layernorm and check speed in bfloat16 comparison with Megatron Layernorm. Thanks.
And let's start including the actual benchmark code in these comments so that:
- others can validate it - it's very easy to make subtle mistakes when writing benchmarks
- we can re-run these in the future
- it's easier to extend to include more comparatives rather than write from scratch
I know in my post I shared the outcome https://github.com/huggingface/transformers/issues/9377 but not the benchmark - will try to do better in the future.
Reposting from chat for documentation:
EleutherAI found the same and removed fused layernorm from GPT-NeoX: https://github.com/EleutherAI/gpt-neox/pull/428