How to compute logits output in parallel for all the input sequence?
This is the code I run:
from ctransformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("TheBloke/LLaMa-7B-GGML", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)
tokens = tokenizer.encode("Hello world! What's up?")
output = model(tokens[None,:])
I got output[0].shape = torch.Size([1, 32000])
I was expecting to get output.logits.shape = torch.Size([1, 7, 32000])
Witch is what I got when I run model from huggingface library.
I dont want exactly the same behavior as hugging face, I just want to get the logits of all the tokens at ounce, instead of having only the next token of the sequence. There's any way to do that with this library?
There is at least one hacky way:
import torch
from transformers import AutoTokenizer
from ctransformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "a b c d" # exactly 4 tokens
input_ids = tokenizer.encode(text, return_tensors="pt")[0]
logits = []
for i in range(1, 1 + len(input_ids)):
tokens = input_ids[:i]
logits.append(model(tokens[None, :], return_dict=True).logits)
logits = torch.cat(logits, dim=1)
assert logits.shape == (1, len(input_ids), tokenizer.vocab_size)
It's super slow b/c it re-computes a lot. Maybe there's a way to avoid that? In llama-cpp-python you'd pass logits_all=True when instantiating the model. And you control the cache via model.eval() and model.n_tokens
Noted with Thank @kddubey