Megatron-DeepSpeed is fused layernorm really better?

https://github.com/pytorch/pytorch/commit/8b87f9a5107e8b3c4f87d5297af698bb55838d81#diff-f12c726e3e8cd2b4768f8984fef27059

I think we don't need to use apex fused layernorm anymore. torch layernorm is better. what do you think about this?

cc. @stas00 @thomasw21

Oct 11 '21 14:10 hyunwoongko

python frontend: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/model/fused_layer_norm.py#L62
kernel: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/fused_kernels/layer_norm_cuda_kernel.cu#L17

Oct 11 '21 14:10 hyunwoongko

I'd say it's best to ask upstream at Meg-LM level, as surely they have benchmarked their code.

Perhaps @jaredcasper could answer your question.

Oct 12 '21 20:10 stas00

Don't you plan to update this repo if the upstream isn't update?

Oct 13 '21 19:10 hyunwoongko

Don't you plan to update this repo if the upstream isn't update?

This is not what I meant. I meant to first ask the original authors why they did it this way. Meg-LM is a highly optimized library, so often they have a good reason for why they are doing something in a certain way.

It'd be good for us to inquire before rushing to change things.

Specifically to this line of inquiry it appears that they don't use apex's fused layernorm, but a modified version of it. So while we know the original apex version is slower than pytorch's one, we don't know anything about the performance of their modified function unless you have benchmarked it already.

Oct 13 '21 20:10 stas00

Just to clarify, I believe the only difference between the fused layer norm in Megatron's code vs APEX is in the types, but that could lead to pretty different performance depending on the types used in the benchmark if one required a separate cast operation.

We haven't benchmarked vs upstream torch in a while, it'd be interesting to know if theirs is faster. Apex/megatron was faster when we first started using it and we just haven't really visited it since then.

Oct 13 '21 21:10 jaredcasper

Yes. I think the only difference of Apex Layernorm and megatron's fused layernorm is type casting about bfloat16. I'll test bfloat16 with torch.nn.Layernorm and check speed in bfloat16 comparison with Megatron Layernorm. Thanks.

Oct 14 '21 01:10 hyunwoongko

And let's start including the actual benchmark code in these comments so that:

others can validate it - it's very easy to make subtle mistakes when writing benchmarks
we can re-run these in the future
it's easier to extend to include more comparatives rather than write from scratch

I know in my post I shared the outcome https://github.com/huggingface/transformers/issues/9377 but not the benchmark - will try to do better in the future.

Oct 14 '21 02:10 stas00

Reposting from chat for documentation:

EleutherAI found the same and removed fused layernorm from GPT-NeoX: https://github.com/EleutherAI/gpt-neox/pull/428

Oct 21 '21 04:10 StellaAthena