turboderp

Results 180 comments of turboderp

Is it possible they're taking so long to load because of the datatype? If Torch doesn't have an efficient bfloat16->float16 function, it might end up in some super-inefficient fallback routine....

Well, if it works with 7b and 13b it's most likely related to GQA. Everything up until that 70b release has assumed that the number of heads is the same...

>I can directly use the weights generated by qlora? If the weights are saved in float16, then yes, it doesn't have to match the model. And it should be possible...

Well, it's not exactly a fix, cause it should really work with fused attn, but I'll get to that. What I need though is an example 70b LoRA I can...

I'm still not sure what "dynamic" positional encodings actually means, and how you would use them with cached keys.

>Unsupported tensor Dtype Have you updated ExLlama to the latest version? I only added bfloat16 very recently, probably hasn't made it into the library yet.

A LoRA does add some overhead, especially when it's targeting all layers with rank-64 adapters. I really would caution everyone training these adapters not to crank up the rank thinking...

>The correct implementation should cache the kv-embeddings before applying RoPE, as the RoPE embedding of every token changes when s changes. This is the part that doesn't make sense to...

I'm not sure what those do exactly, especially since the default RoPE implementation already adapts to the hidden dimension of the model. But the hidden dimension of the model is...

I haven't tested 70B on A100 before, but the speed is close to what I've seen for 65B on A100, so I think this is about expected, yes.