fairydreaming
fairydreaming
@kyteinsky lol, didn't notice that all that time, sorry
@Sadeghi85 One thing that is missing in your code is a preparation of the input to `decode()` call. Check how it's done in llama-cli source code: https://github.com/ggerganov/llama.cpp/blob/b841d0740855c5af1344a81f261139a45a2b39ee/examples/main/main.cpp#L536-L552 So before calling...
I see there are serious problems with using T5, so I added a branch with a high-level example of inference with T5 model: https://github.com/fairydreaming/llama-cpp-python/tree/t5 Also there is a second branch...
@yugaljain1999 Yes, you can pass multiple prompts. I don't know how it works in llama-cpp-python high-level API, but in llama.cpp (low-level API) you do it by creating a batch containing...
@yugaljain1999 T5 models are still not supported in llama-server
@Vedapani0402 Some time ago I created these two llama-cpp-python branches with working high-level and low-level T5 examples: https://github.com/fairydreaming/llama-cpp-python/tree/t5 https://github.com/fairydreaming/llama-cpp-python/tree/fix-low-level-examples If you follow them everything should work just fine. Note that...
@Vedapani0402 You are right, in the meantime llama.cpp API changed so much that it was impossible for this to still work correctly. I updated low-level examples in [fix-low-level-examples](https://github.com/fairydreaming/llama-cpp-python/tree/fix-low-level-examples) branch to...
> hi [@fairydreaming](https://github.com/fairydreaming), Thank you for updating the code. > > But there are 2 issues I am currently facing: > > 1. The updated code is working as expected...
@Vedapani0402 As far as I remember I based the code on existing llama example, I have no idea if it's ready for handling multiple inputs. But maybe instead of freeing/recreating...
@Vedapani0402 Sorry, I have no experience with running small quantized models (usually I run them with f16), so can't help there.