LoRA batch=1,why adapter latency so much vs. LoRA in paper???

in lora paper section 3:

Adapter Layers Introduce Inference Latency :There are many variants of adapters. We focus on the original design by Houlsby et al. (2019) which has two adapter layers per Transformer block and a more recent one by Lin et al. (2020) which has only one per block but with an additional LayerNorm (Ba et al., 2016). While one can reduce the overall latency by pruning layers or exploit- ing multi-task settings , there is no direct ways to bypass the extra compute in adapter layers. This seems like a non-issue since adapter layers are designed to have few parameters (sometimes <1% of the original model) by having a small bottleneck di- mension, which limits the FLOPs they can add. However, large neural networks rely on hardware parallelism to keep the latency low, and adapter layers have to be processed sequentially. This makes a difference in the online inference setting where the batch size is typically as small as one.

======== so why this adapter model so special and adapter layers have to be processed sequentially. other part of llm such as normal transformer block or MLP layer norm in Transformer are not sequentially??? why adapter so different , not like a SEmodel ([Squeeze-and-Excitation Networks]

Sep 27 '23 02:09 macqueen09

in Lora paper , we can see adapter in batch=1 can make a 20% latency. why?
so little parameter can make a so huge latency.

Sep 27 '23 02:09 macqueen09

This is because when the bsz is small, we need to parallelize over width to gain the best hardware efficiency. Adapter adds to the depth, which has to be processed sequentially.

Oct 29 '23 00:10 edwardjhu

This is because when the bsz is small, we need to parallelize over width to gain the best hardware efficiency. Adapter adds to the depth, which has to be processed sequentially.

@edwardjhu Can you please explain this in layman terms?

Feb 01 '24 02:02 barvin04