Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

is fused layernorm really better?

Open hyunwoongko opened this issue 4 years ago • 8 comments

https://github.com/pytorch/pytorch/commit/8b87f9a5107e8b3c4f87d5297af698bb55838d81#diff-f12c726e3e8cd2b4768f8984fef27059

I think we don't need to use apex fused layernorm anymore. torch layernorm is better. what do you think about this?

cc. @stas00 @thomasw21

hyunwoongko avatar Oct 11 '21 14:10 hyunwoongko

  • python frontend: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/model/fused_layer_norm.py#L62
  • kernel: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/fused_kernels/layer_norm_cuda_kernel.cu#L17

hyunwoongko avatar Oct 11 '21 14:10 hyunwoongko

I'd say it's best to ask upstream at Meg-LM level, as surely they have benchmarked their code.

Perhaps @jaredcasper could answer your question.

stas00 avatar Oct 12 '21 20:10 stas00

Don't you plan to update this repo if the upstream isn't update?

hyunwoongko avatar Oct 13 '21 19:10 hyunwoongko

Don't you plan to update this repo if the upstream isn't update?

This is not what I meant. I meant to first ask the original authors why they did it this way. Meg-LM is a highly optimized library, so often they have a good reason for why they are doing something in a certain way.

It'd be good for us to inquire before rushing to change things.

Specifically to this line of inquiry it appears that they don't use apex's fused layernorm, but a modified version of it. So while we know the original apex version is slower than pytorch's one, we don't know anything about the performance of their modified function unless you have benchmarked it already.

stas00 avatar Oct 13 '21 20:10 stas00

Just to clarify, I believe the only difference between the fused layer norm in Megatron's code vs APEX is in the types, but that could lead to pretty different performance depending on the types used in the benchmark if one required a separate cast operation.

We haven't benchmarked vs upstream torch in a while, it'd be interesting to know if theirs is faster. Apex/megatron was faster when we first started using it and we just haven't really visited it since then.

jaredcasper avatar Oct 13 '21 21:10 jaredcasper

Yes. I think the only difference of Apex Layernorm and megatron's fused layernorm is type casting about bfloat16. I'll test bfloat16 with torch.nn.Layernorm and check speed in bfloat16 comparison with Megatron Layernorm. Thanks.

hyunwoongko avatar Oct 14 '21 01:10 hyunwoongko

And let's start including the actual benchmark code in these comments so that:

  1. others can validate it - it's very easy to make subtle mistakes when writing benchmarks
  2. we can re-run these in the future
  3. it's easier to extend to include more comparatives rather than write from scratch

I know in my post I shared the outcome https://github.com/huggingface/transformers/issues/9377 but not the benchmark - will try to do better in the future.

stas00 avatar Oct 14 '21 02:10 stas00

Reposting from chat for documentation:

EleutherAI found the same and removed fused layernorm from GPT-NeoX: https://github.com/EleutherAI/gpt-neox/pull/428

StellaAthena avatar Oct 21 '21 04:10 StellaAthena