fairydreaming
fairydreaming
@Vedapani0402 You can set the number of threads in llama_context_params structure - there are n_threads and n_threads_batch fields for this. In the example this will be in cparams variable before...
@Vedapani0402 When I implemented T5 support in llama.cpp I tested inference of multiple sequences at once and it worked. The ticket you mentioned is for high-level API. If you use...
@Vedapani0402 Here's a C++ example that I used to test batched inference. I updated it to the current llama.cpp API: ``` #include "arg.h" #include "common.h" #include "log.h" #include "llama.h" #include...
@Vedapani0402 I think you should move: ``` llama_cpp.llama_batch_free(batch) llama_cpp.llama_kv_cache_clear(ctx) ``` from `test_model_gguf()` to the end of your `batch_inference()`.
@Vedapani0402 I really don't know why, the output looks good for me when I run your code: ``` load_model("/mnt/md0/models/t5-base.gguf") texts=[ "translate English to German: The house is wonderful.", "translate English...
@Vedapani0402 I added one more batch and now the problem appeared.
@Vedapani0402 I found the cause, it's actually my fault. 😱 The fix for this problem is here: https://github.com/ggml-org/llama.cpp/pull/12470
@Vedapani0402 There is no Python counterpart for this, you simply have to rebuild llama-cpp-python with the current llama.cpp source code that contains the fix.
Unfortunately the finetune util (later renamed to llama-finetune) was removed from the project several months ago in PR #8669. What you can do is to use some older release that...