ctransformers How to compute logits output in parallel for all the input sequence?

This is the code I run:

from ctransformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("TheBloke/LLaMa-7B-GGML", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)

tokens = tokenizer.encode("Hello world! What's up?")
output = model(tokens[None,:])

I got output[0].shape = torch.Size([1, 32000]) I was expecting to get output.logits.shape = torch.Size([1, 7, 32000]) Witch is what I got when I run model from huggingface library.

I dont want exactly the same behavior as hugging face, I just want to get the logits of all the tokens at ounce, instead of having only the next token of the sequence. There's any way to do that with this library?

Sep 28 '23 22:09 djmMax

There is at least one hacky way:

import torch
from transformers import AutoTokenizer
from ctransformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "a b c d"  # exactly 4 tokens

input_ids = tokenizer.encode(text, return_tensors="pt")[0]
logits = []
for i in range(1, 1 + len(input_ids)):
    tokens = input_ids[:i]
    logits.append(model(tokens[None, :], return_dict=True).logits)
logits = torch.cat(logits, dim=1)

assert logits.shape == (1, len(input_ids), tokenizer.vocab_size)

It's super slow b/c it re-computes a lot. Maybe there's a way to avoid that? In llama-cpp-python you'd pass logits_all=True when instantiating the model. And you control the cache via model.eval() and model.n_tokens

Oct 05 '23 19:10 kddubey

Noted with Thank @kddubey

Oct 22 '23 16:10 djmMax