ctransformers Inference time is slow with GPU and CPU and some general doubts related to that.

I started out with experimenting a bit with CTransformers. The device I have been using is

ASUS Laptop 
16 GB RAM
6 GB NVIDIA RTX 3060

And I tried to install mpt-7B chat ggml file and this was the code

from time import time
from ctransformers import AutoModelForCausalLM

path = "/home/anindyadeep/.cache/huggingface/hub/models--TheBloke--mpt-7b-chat-GGML/snapshots/c625e25385a2af9a8e93c77d069d78f7c5105687/mpt-7b-chat.ggmlv0.q8_0.bin"
llm = AutoModelForCausalLM.from_pretrained(path, model_type='mpt', gpu_layers=50)

start_time = time()
print("started ...")
output = llm('Write neural network code example in c++')
end_time = time()

print("Output:", output)
print("Total time taken:", end_time - start_time, "seconds")

First of all, I am not sure, whether it is using full GPU capacity and also the time taken was around 70 seconds and also I don't know the generation is not much good. Here is an example of the model response

====================== Model response ======================

Learn c++
Write neural network code example in c++
To write a simple neural network program, you will need to define the following:
1. The input layer size (number of inputs)
2. The hidden layer size(s)
3. The output layer size (number of outputs)
4. The activation function for each neuron in the hidden and output layers
5. The training data set with corresponding target values, including labels if applicable
6. A stop criterion to determine when training is complete
Here's an example code that implements a simple neural network using backpropagation algorithm:

#include <iostream>
#include <vector>
using namespace std;
// Define the input layer size and activation function
const int INPUT_SIZE = 10; // number of inputs
const string ACTIVATION_FUNC("sigmoid");
// Define the hidden layers and output layer sizes, activation functions and stop criterion
const int HIDDEN1_SIZE = 5; const double LEARNING_RATE = 0.01; 
double BETA = 0.5; // regularization parameter for L2 loss function

====================== Model response end ======================

Although the model response is not perfect here. But if I keep that onto a side note then also the time require for all the generations are within the range of 60 - 80 seconds. So I have some of the following questions

I have installed CTransformers for GPU as mentioned on the instructions
Now how can be sure that the model is now using CTransformers or not?
It is mentioned that CTransformers supports Llama and Falcon for GPU, I thought it might throw error for MPT, but it did not. So does it supports now?
What does gpu_layers exactly means, I given values like 1, 3, 50 but getting not much of a difference
Is token streaming supported for this?

Jul 31 '23 20:07 Anindyadeep

MPT models don't have GPU support. It doesn't throw error and simply runs on CPU. May be I should print a warning when someone tries to use gpu_layers for non-llama/falcon models.
gpu_layers means number of layers to run on GPU. Depending on how much GPU memory is available you can increase gpu_layers. Start with a larger value gpu_layers=100 and if it runs out of memory, try smaller values.

Yes, you can pass stream=True:

for text in llm(prompt, stream=True):
    print(text, end="", flush=True)

Aug 03 '23 21:08 marella

That is awesome, also @marella I have one last question after which we can close this issue, can we convert a hugging face model to ggml through CTransformers. What I meant to say, suppose I took a llama 2 model and fine tuned it using peft and now I have the peft weights and I attached them to my hugging face model to work along, but now I want to change it to ggml, can I do through CTransformer interface.

Thanks

Aug 04 '23 03:08 Anindyadeep

That is awesome, also @marella I have one last question after which we can close this issue, can we convert a hugging face model to ggml through CTransformers. What I meant to say, suppose I took a llama 2 model and fine tuned it using peft and now I have the peft weights and I attached them to my hugging face model to work along, but now I want to change it to ggml, can I do through CTransformer interface.

Thanks

Taking the liberty of linking this since your question is in two different issues: yes, cf https://github.com/marella/ctransformers/issues/67#issuecomment-1665217227

Aug 04 '23 08:08 drvenabili

Yeah and thanks for clarifying and we can close this issue. Thanks

Aug 04 '23 10:08 Anindyadeep

Recently CUDA support (experimental) was added for MPT models. Please update, try it out and let me know if the performance has improved.

Aug 07 '23 19:08 marella

Sure, let me check that out

Aug 08 '23 03:08 Anindyadeep

@marella is there any way to find out how many gpu layers does a specific GPU has ?

Oct 11 '23 07:10 VpkPrasanna