llm Support for RWKV

So this is a pretty immense task and I'd start with #45, but...

RWKV is an RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's 100% attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" mode to quickly compute the hidden state for the "RNN" mode.

So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).

It's entirely open-source, so not legally burdened like LLaMA, and (from what I've seen) is more powerful than BLOOM at the same parameter count.

I asked the RWKV Discord which implementation would be worth looking at, and this is what I was told:

RWKV-LM/RWKV-v4neo/src/model.py is the implementation that's actually used to train the large models, it's cuda only and has tons of features you probably don't need. rwkv_pip_package only implements inference, but is a good implementation and worth a look, recently got a lot more complex due to supporting more and more strategies and including various optimizations. ChatRWKV/src/model_run is an older version, but haven't played with it so not sure how good it is. Might be worth a look since it's basically an older version of the one in rwkv_pip_package. RWKV_in_150_lines.py I still haven't fully checked out, but I know it doesn't support GPT mode, so that may or may not be less useful Also worth a look is RWKV-v4neo/src/model_run.py, which is a small inference-only impl capable of loading the large RWKV checkpoints I'm not sure if it has GPT-mode, though

So it sounds like rwkv_pip_package is the way to go as source material:

https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py

The following articles are very useful for understanding how RWKV works:

https://johanwind.github.io/2023/03/23/rwkv_overview.html
https://johanwind.github.io/2023/03/23/rwkv_details.html

An interesting detail from the latter is the following:

The largest number a 16-bit floating point number (float16) can represent is 65 504, anything above that overflows, which is bad. Most of the code has no problems with this, partially because the Layer Normalizations keep values in a reasonable range. However, the RWKV attention contains exponentially large numbers (exp(bonus + k)). In practice, the RWKV attention is implemented in a way where we factor out an exponential factor from num and den to keep everything within float16 range. See for example the time_mixing function in RWKV in 150 lines.

This may pose issues for the GGML 4-bit quantisation format, which is non-optimal. We would likely want GPTQ quantisation.

Mar 26 '23 00:03 philpax

RWKV is really interesting, but the Python code is absolutely horrendous. I made an attempt to clean it up a little but the author wasn't interested.

I'd look at this one if you were interested in trying to make the Rust code just a frontend to a Python backend: https://github.com/harrisonvanderbyl/rwkvstic

But I sort of don't know what the benefit would be of doing that for a CLI app, because there's already stuff like Oobabooga's text thingy that handles a lot of different model formats with a front end.

I believe the rwkvstic repo I linked is on a path to get added to the actual Transformers repo as one of the supported models. If you wanted to do a Python frontend to stuff, maybe it would make more sense to look at making the frontend for Transformers rather than RWKV specifically. Then you'll just pretty much get that for free one support is merged.

https://github.com/huggingface/transformers/issues/20737

https://github.com/huggingface/transformers/pull/20809

Sort of related, but the TextSynth server (closed source) actually supports/publishes some 4bit RWKV models: https://bellard.org/ts_server/

It can do pretty fast inference on CPU. I contact the author and asked if he'd be willing to disclose the model format for those files and he said he probably would post a Python converter soon which would document the format as well.

Mar 26 '23 01:03 KerfuffleV2

Nah, we'd want to port the actual inference code to Rust, similarly to what we did for LLaMA itself. The less Python we have in the codebase, the better 🦀

Mar 26 '23 01:03 philpax

After a whole lot of struggling because I have essentially no idea of what I'm doing with NN or math stuff: https://github.com/KerfuffleV2/smolrsrwkv

Obviously that's not useful as anything more than an example of the basics, but it does work and generates tokens not all that much slower than the simple Torch version.

Mar 27 '23 22:03 KerfuffleV2

Oh wow, nice one! That's an excellent start - do you want to try porting it to ggml-rs? (#81 exposes it as a separate library, so you should be able to use a Git dependency)

Mar 27 '23 22:03 philpax

(By the way, are you in the Discord? It'd be great to chat synchronously if need be)

Mar 27 '23 22:03 philpax

Thanks, I didn't really do much more than port the Python example here though: https://johanwind.github.io/2023/03/23/rwkv_details.html

do you want to try porting it to ggml-rs?

I'd have to say probably no. I don't think I'm qualified to write a version that's actually worth using in real software. So if you or anyone else wants to use my code as an example or starting point, you're certainly very welcome.

I may mess around with trying to get it to work on backends other than ndarray but I still don't think it'll be suitable for production. Also, sad to say, I have a very short attention span. Just generally speaking, you're probably better off not depending on me for anything long term.

Mar 27 '23 22:03 KerfuffleV2

I don't think I'm qualified to write a version that's actually worth using in real software.

Eh, none of us are really LLM experts - it's more a matter of engineering and refinement. I think you're just as qualified as the rest of us. Happy to help any efforts get across the line.

Also, sad to say, I have a very short attention span. Just generally speaking, you're probably better off not depending on me for anything long term.

Don't worry, I know the feeling very well. No worries at all; a driveby contribution is fine as long as it's maintainable, and from what I've seen of your implementation, you're doing fine in that regard.

I'd encourage you to give it a try, but no pressure at all - even if you don't do it, your port will still be a great resource for those coming after (especially with those sweet static types!)

Mar 27 '23 22:03 philpax

@philpax

I think you're just as qualified as the rest of us.

I don't know know about that! I also really don't know any math stuff either. So I think I'd probably more focus on trying to make it a good example for understanding how the model works in its basic form rather than something that could be used directly.

I'd encourage you to give it a try, but no pressure at all

I did look, but one thing I noticed is it doesn't seem like GGML supports the operations that are required (or I don't know how to work around/refactor based on it). For example, how to do stuff like subtract tensors from a value, or map something like exponent to the elements? It doesn't seem obvious how to do that with the API it provides.

Could be it provides a higher level way to accomplish the same stuff, but unfortunately I don't have a high level understanding of how it works so that's kind of beyond my ability.

Anyway, maybe if you or someone showed me how to do that, porting it to GGML would be more practical. Here's an example of the kind of thing I mean: https://github.com/KerfuffleV2/smolrsrwkv/blob/c21cb8008b51aa10fb2c0eaa2a81714e8c27f76f/src/model.rs#LL125C70-L125C70 (that's elementwise map).

even if you don't do it, your port will still be a great resource for those coming after (especially with those sweet static types!)

Haha, yeah, I honestly found the Python example very hard to follow since there were no types and it was using weird NumPy shortcuts like sometensors[a > b] = 0 — indexing with a boolean expression, what? Figuring out silly stuff like that actually took as much time as writing the actual logic.

I cleaned up/reorganized the Rust version a bit, so hopefully it's easier to understand now. The state part was particularly bad in the initial version.

Mar 28 '23 08:03 KerfuffleV2

I also really don't know any math stuff either.

You know your way around a type system, that's good enough for me :P

I did look, but one thing I noticed is it doesn't seem like GGML supports the operations that are required (or I don't know how to work around/refactor based on it). For example, how to do stuff like subtract tensors from a value, or map something like exponent to the elements? It doesn't seem obvious how to do that with the API it provides.

Oh... yeah. Just had some free time to look at this, and yeah, ggml doesn't support some critical operators (it doesn't seem to support any form of exponentiation...).

For subtracting tensors with a value, you'd use op_sub, potentially with new_f32 (if I'm interpreting what you're asking correctly). That being said, there's definitely non-zero challenges here - I was hoping that it was just a case of operations that we left out of the bindings, but it seems like there's actual holes here.

ggml's API is based around building up a computation graph and then executing all the computation at once, but it means that if you need an operation that it doesn't support you'll need to add it yourself.

weird NumPy shortcuts like sometensors[a > b] = 0 — indexing with a boolean expression, what?

Unfortunately, this is something I recognize from my MATLAB days - it's convenient, but it is absolutely bewildering the first time you see it 😅

I cleaned up/reorganized the Rust version a bit, so hopefully it's easier to understand now. The state part was particularly bad in the initial version.

Yeah, nice, appreciate it! Given that you can't actually implement a ggml version of this, I don't have any immediate suggestions for you, but you can explore the other Rust ML libraries if you're curious.

Still, tremendously cool to have a RWKV implementation in easy-to-understand Rust - you should be proud!

Mar 28 '23 17:03 philpax

After profiling it, 80-90% or more of the time is just spent doing the matrix multiplication. The rest isn't really significant. How crazy would it be to just use ggml for only that one operation. Even having to copy the data every time it probably still would be worth it.

It may be a lot faster, but that still wouldn't really make that code suitable for general use though since it does everything in f32 and there isn't any easy way around that with the current approach.

Mar 29 '23 03:03 KerfuffleV2

I'm not convinced that ggml would be faster for the f32 case; I think most of its benefits come through with quantized values, which would be hard to test with RWKV.

Mar 29 '23 10:03 philpax

Well, the big deal is currently it only uses a single thread. Supposedly ndarray supports threading but I've failed to get it to actually use more than one.

I also messed around with Rayon just running the 3 and 2 groups of .dot() operations in parallel, but it didn't really make a difference because there's one place it's forced to a single thread and one place where it can only run two in parallel. I only got like a 20-30% speed increase at best, when I have 6 cores and 12 threads available.

So just running those matmul ops in parallel would make a huge, huge difference I think.

Mar 29 '23 10:03 KerfuffleV2

Hmm, yeah, makes sense. Do the matmuls happen at the end or do they need to be copied back and forth?

Mar 29 '23 11:03 philpax

Probably have to be copied back and forth. You can look here: https://github.com/KerfuffleV2/smolrsrwkv/blob/main/src/model.rs

channel_mixing (3) — k and r are independent, but vk depends on k.
time_mixing (3) — All independent.
evaluate_layer (1) — Only happens once per step rather than per layer.

Each group is basically all together, but they need data from other calculations and further calculations depend on those results as well.

(By the way, I'm getting reasonably happy with how readable it is now that it's all split into logical modules. The main function is looking pretty clean. If you have time to skim it, what do you think of the current state as an example for other projects?)

Mar 29 '23 12:03 KerfuffleV2

Maybe it's possible to construct ggml Tensors directly from the ndarray representation? Not sure, though - the reason ggml can parallelize is because the computational graph lets it spread work across without having to wait for the main thread to dispatch more work. It might be possible to set up contexts/graphs to execute the matmuls, but I'm not sure if the overhead makes it worth it.

Setzer's OK with adding more operations to ggml, so we'd ideally do it all there, but it'll require some work to wire up (cf #45).

Your definition of the model looks really clean! Nice work - it's definitely much easier to understand when it's split up like that. I wonder if we can split up llama.rs code like that - it might be a little difficult because of the aforementioned graph, but I don't see it being too troublesome.

Mar 29 '23 19:03 philpax

@philpax

I found a another way: https://github.com/KerfuffleV2/smolrsrwkv/blob/ab3df0c706ff786a943a3896d7d96b597b45fc98/src/util.rs#L121

It actually runs at acceptable speed now (relatively speaking), even on the 3B model.

Pretty simple now, but getting here was quite the struggle. ndarray claims to use the matrixmultiply crate which supports threading and even has a feature flag to turn it on, but it does absolutely nothing. That's because ndarray just never will even call matrixmultiply for matrix-vector multiplication.

The big issue now is 32bit models just use an insanely impractical amount of memory. I think I may have taken this as far as it can go because stuff like quantization or using non-float types looks to be a huge undertaking.

Mar 29 '23 23:03 KerfuffleV2

Fantastic! Yeah, it doesn't look like ndarray supports f16 or lower types - I'm not sure what the best plan of attack is after that. As far as I can tell, none of the other libraries support <f32 inference.

Once #85 lands, we will have opened ggml to extension by us, so we can investigate implementing the functionality needed to implement RWKV. I don't have any other suggestions until then 😅

Mar 30 '23 07:03 philpax

Looks like someone's going for it https://github.com/saharNooby/rwkv.cpp

Mar 30 '23 18:03 philpax

Oh, nice. I look forward to ripping off their ide... I mean collaborating in the spirit of open source.

This is probably poor etiquette but there's no ability to create issues in that repo:

@saharNooby Just in case it's useful, tagging you so you're aware of the efforts/discussion here. Also (in even more poor taste) I'll plug my own little Rust implementation: https://github.com/KerfuffleV2/smolrsrwkv

It's possible a C/C++ programmer could look at that and get something out of it (good chance you already know more than I do though).

Mar 30 '23 18:03 KerfuffleV2

Hi all! I've completely missed that Issues were not enabled for my repo, but I've enabled them now, thanks for pointing it out!

Looks like our work is closely related by extending ggml, but diverges at actual implementation of the model -- you do it in Rust, I do it in C++. I'll read the thread and code of both repositories and later write a more proper response.

For the context, here's what I know so far:

ggml is missing element-wise max(x, y), exp(x), which are required for RWKV
it is also missing sigmoid(x), but it has silu(x), so I think I can approximate it with ggml_silu(x) / x
layer_norm(x, w, b) works great as ggml_norm(x) * w + b

My next goal is to implement max and exp. I see that this PR would be a great showcase for me of how to add a new operator to ggml!

Mar 31 '23 05:03 saharNooby

@saharNooby

Looks like our work is closely related by extending ggml, but diverges at actual implementation of the model -- you do it in Rust, I do it in C++.

What I linked is a separate project (the only relation here is llama-rs wants to implement RWKV and I've also contributed a little bit of code here). My project just uses a general math library to implement a simple version of RWKV by doing the calculations directly on matrices/vectors.

So basically it's just useful for understanding how RWKV fits together. Just from the other stuff you said just now, it seems like you're already past the point of something like that being helpful.

Mar 31 '23 14:03 KerfuffleV2

@KerfuffleV2

What I linked is a separate project

I see. I meant your work here in repo llama-rs :)

Anyway, you may be interested in newly added exp, max, 1_minus_x and sigmoid operators for ggml: commit. For now, only forward pass and FP32. I've also added simple test suite that validates results against PyTorch's results.

With new operators, I managed to get non-NaN logits, yay! Will check their correctness against RWKV reference implementation soon.

Mar 31 '23 15:03 saharNooby

Interesting. Is there a reason to implement those elementwise operations all separately instead of adding a generic elementwise map operation?

The matrix multiplications matter so much with this, it's crazy. I'm thinking about just setting up a GGML graph for those operations only. The rest can be done in a naive/simple way and still looks to achieve 95% of the maximum performance.

Apr 02 '23 13:04 KerfuffleV2

Is there a reason to implement those elementwise operations all separately instead of adding a generic elementwise map operation?

I guess it was simpler for me to just add new operations, copy-pasting code of existing operations; than to invent a new (for ggml) way of doing things... But I agree that generic map operation would work too, if someone implemented it. Not sure how to do it in C/C++.

Apr 02 '23 13:04 saharNooby

I can't really help you with the C++ part. Come over to the Rust side!

In seriousness though, you may well end up doing less work overall if you take that approach because it's just one place to worry about changes. You can just have the operation take a function pointer and then apply that function pointer to the element.

Basically same as you're doing in ggml_compute_forward_exp_f32 except you'd call the user function instead of ggml_vec_element_wise_exp_f32. And you'd want one for applying a binary operation between two tensors like ggml_compute_forward_max_f32 — the idea would be the same though.

inline static void ggml_vec_mapv_f32 (
  const int n, float * y, const float * x,
  float (*fun)(float)) {
    for (int i = 0; i < n; ++i) y[i]  = fun(x[i]);
}

static void ggml_compute_forward_mapv_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    assert(params->ith == 0);
    assert(ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    const int n  = ggml_nrows(src0);
    const int nc = src0->ne[0];

    assert(dst->nb[0]  == sizeof(float));
    assert(src0->nb[0] == sizeof(float));
    fun = GET_FUNCTION_DEFINITION_SOMEHOW();
    for (int i = 0; i < n; i++) {
        ggml_vec_mapv_f32(nc,
                (float *) ((char *) dst->data  + i*( dst->nb[1])),
                (float *) ((char *) src0->data + i*(src0->nb[1])), fun);
    }
}

It looks like storing metadata like a function pointer could be pretty annoying. From looking at other functions, it seems like the approach that was used was to use the opt field and create a 1d tensor with one single value. However, it seems to only support i32 as an integer type. The horrible hacky way to use that for a function point would be to just use two of those 32 bit integers to store the top/bottom half of the function pointer and then reassemble it when you need the function pointer. It's annoying to have to do but shouldn't really make a difference for performance.

Note: Many years ago I was a C programmer but I never got into C++. Consider the above pseudocode.

Apr 02 '23 13:04 KerfuffleV2

Might be getting annoying me writing so many comments here, but:

I've been working on my Rust RWKV implementation and got 8bit quantization working. I also managed to split it into a library so it would be pretty easy to interface with: https://github.com/KerfuffleV2/smolrsrwkv/blob/main/smolrwkv-cli/src/main.rs

Interestingly, it seems about as fast as the official Python version when running on CPU, however it's basically still too slow to really be practical. the 3B model runs about as fast as a 13B model on llama-rs or llama.cpp. According to profiling, as expected it's just the matrix multiplication that's taking all the time. It's possible it could reach acceptable levels of performance if it could interface with a higher performance matrix multiplication library.

I don't know if this is something that llama-rs would want to depend on even if performance wasn't a problem. It would require a completely separate non-GGML codepath to use.

Apr 02 '23 15:04 KerfuffleV2

@KerfuffleV2 Do I understand quantization algo correctly, that for each matrix row, you determine min and max value, and then represent each element as (uint8) ((e - min) / (max - min))?

Apr 02 '23 16:04 saharNooby

@saharNooby

Uhhh... I basically cargo culted it from the official version so I don't know that I can give you a good answer here. See:

https://github.com/BlinkDL/ChatRWKV/blob/0d0abf181356c6f27501274cad18bdf28c83a45b/rwkv_pip_package/src/rwkv/model.py#L237
https://github.com/BlinkDL/ChatRWKV/blob/0d0abf181356c6f27501274cad18bdf28c83a45b/rwkv_pip_package/src/rwkv/model.py#L335

The second one is the CPU-based matrix multiplication function for when 8bit quantization is in effect. So those extra values are required at the point you do the MM.

Also worth pointing out is my implementation is based on the example code here: https://johanwind.github.io/2023/03/23/rwkv_details.html

You probably noticed, but only some of the tensors are (or can be) quantized. The smaller and 1D ones are left as 32bit.

I think there are some differences between that and the official version (although the 8bit quantization stuff still seemed to fit in — with some changes). For one thing, the official version transposes some of the tensors while mine doesn't. It also has an extra state item.

quick edit: The Rust 8bit MM function is here, by the way: https://github.com/KerfuffleV2/smolrsrwkv/blob/076d14882be2ca471796c555d3c967d8e4d2585d/smolrwkv/src/util.rs#L159

Apr 02 '23 16:04 KerfuffleV2

I've been messing around trying to allow GGML to map arbitrary operations: https://github.com/KerfuffleV2/llama-rs/blob/5fd882035e95501d4127e30c84a838afbffcc95e/ggml/src/lib.rs#L207

This what it looks like in use: https://github.com/KerfuffleV2/llama-rs/blob/5fd882035e95501d4127e30c84a838afbffcc95e/llama-rs/src/lib.rs#L1310

The first one is just replacing the ggml_add operation, the second one is a nop (it does have to copy the data to the destination though).

Those two ops should enable supporting other models like RWKV where GGML doesn't currently have the required operations. I am going to try to see if I can get this (or something similar) added to GGML but in the meantime or if that fails, we could potentially use a patched version of GGML that enables this stuff.

The obvious downside is that it would make merging GGML changes more of a pain. (Also, I'm not really a C developer anymore, so while it appears to work there may be other issues with the current implementation.)

Apr 10 '23 13:04 KerfuffleV2

Interesting! Yes, we're open to having our own fork of GGML - we'll just have to manage the patches ourselves.

The primary reason that I think they would shy away from it is because a native implementation would almost always be faster (especially with vectorised operations); that being said, it may be sufficient for operations that aren't on the hot path.

I'm not sure if it'd be better to go with @saharNooby's custom implemented operations or to use your solution (or both). How much of RWKV's unsupported operations are on the hot path?

Apr 10 '23 14:04 philpax