Inference time is slow with GPU and CPU and some general doubts related to that.
I started out with experimenting a bit with CTransformers. The device I have been using is
ASUS Laptop
16 GB RAM
6 GB NVIDIA RTX 3060
And I tried to install mpt-7B chat ggml file and this was the code
from time import time
from ctransformers import AutoModelForCausalLM
path = "/home/anindyadeep/.cache/huggingface/hub/models--TheBloke--mpt-7b-chat-GGML/snapshots/c625e25385a2af9a8e93c77d069d78f7c5105687/mpt-7b-chat.ggmlv0.q8_0.bin"
llm = AutoModelForCausalLM.from_pretrained(path, model_type='mpt', gpu_layers=50)
start_time = time()
print("started ...")
output = llm('Write neural network code example in c++')
end_time = time()
print("Output:", output)
print("Total time taken:", end_time - start_time, "seconds")
First of all, I am not sure, whether it is using full GPU capacity and also the time taken was around 70 seconds and also I don't know the generation is not much good. Here is an example of the model response
====================== Model response ======================
Learn c++
Write neural network code example in c++
To write a simple neural network program, you will need to define the following:
1. The input layer size (number of inputs)
2. The hidden layer size(s)
3. The output layer size (number of outputs)
4. The activation function for each neuron in the hidden and output layers
5. The training data set with corresponding target values, including labels if applicable
6. A stop criterion to determine when training is complete
Here's an example code that implements a simple neural network using backpropagation algorithm:
#include <iostream>
#include <vector>
using namespace std;
// Define the input layer size and activation function
const int INPUT_SIZE = 10; // number of inputs
const string ACTIVATION_FUNC("sigmoid");
// Define the hidden layers and output layer sizes, activation functions and stop criterion
const int HIDDEN1_SIZE = 5; const double LEARNING_RATE = 0.01;
double BETA = 0.5; // regularization parameter for L2 loss function
====================== Model response end ======================
Although the model response is not perfect here. But if I keep that onto a side note then also the time require for all the generations are within the range of 60 - 80 seconds. So I have some of the following questions
- I have installed CTransformers for GPU as mentioned on the instructions
- Now how can be sure that the model is now using CTransformers or not?
- It is mentioned that CTransformers supports Llama and Falcon for GPU, I thought it might throw error for MPT, but it did not. So does it supports now?
- What does
gpu_layersexactly means, I given values like 1, 3, 50 but getting not much of a difference - Is token streaming supported for this?
- MPT models don't have GPU support. It doesn't throw error and simply runs on CPU. May be I should print a warning when someone tries to use
gpu_layersfor non-llama/falcon models. -
gpu_layersmeans number of layers to run on GPU. Depending on how much GPU memory is available you can increasegpu_layers. Start with a larger valuegpu_layers=100and if it runs out of memory, try smaller values. - Yes, you can pass
stream=True:for text in llm(prompt, stream=True): print(text, end="", flush=True)
That is awesome, also @marella I have one last question after which we can close this issue, can we convert a hugging face model to ggml through CTransformers. What I meant to say, suppose I took a llama 2 model and fine tuned it using peft and now I have the peft weights and I attached them to my hugging face model to work along, but now I want to change it to ggml, can I do through CTransformer interface.
Thanks
That is awesome, also @marella I have one last question after which we can close this issue, can we convert a hugging face model to ggml through CTransformers. What I meant to say, suppose I took a llama 2 model and fine tuned it using peft and now I have the peft weights and I attached them to my hugging face model to work along, but now I want to change it to ggml, can I do through CTransformer interface.
Thanks
Taking the liberty of linking this since your question is in two different issues: yes, cf https://github.com/marella/ctransformers/issues/67#issuecomment-1665217227
Yeah and thanks for clarifying and we can close this issue. Thanks
Recently CUDA support (experimental) was added for MPT models. Please update, try it out and let me know if the performance has improved.
Sure, let me check that out
@marella is there any way to find out how many gpu layers does a specific GPU has ?