Using `inputs_embeds` for generation creates different results (and gives a warning)
I'm trying to use the inputs_embeds parameter to run the LLaMA model. This is part of my code.
# INPUT = ...embedding of a sequence, ensuring that there are no pad tokens
output_sequences = LLaMA.generate(
inputs_embeds=INPUT.to(device)
pad_token_id=tokenizer.pad_token_id,
# ... generation parameters, top_p top_k etc.
)
I keep getting this warning, and the results are complete gibberish. I know this exact model performs well if I pass input_ids.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left' when initializing the tokenizer.
After a lot of debugging, I found that this issue is because of the transformers library itself. The generate function checks that the last token ID in every batch should not be the pad token ID. If it is, it displays this warning.
https://github.com/huggingface/transformers/blob/a0e733283930bdb9ae2b1afdc53ec5f2daefb033/src/transformers/generation/utils.py#L1308-L1315
The generate function is expecting the shape (Batch, Sequence) where this logic would work.
inputs_tensor[:, -1] == generation_config.pad_token_id
Now the problem is that I am passing inputs_embeds not IDs. My shape is (Batch, Sequence, EmbeddingSize), so the above statement would be true if there are any zeros in the embedding of the last token. This is obviously incorrect.
That explains the warning but not the incorrect generation.
Environment
-
transformers==4.28.0 - Python 3.10.11
cc @gante
Hey @zrthxn π Splitting my reply in two parts, the warning and the generation from input embeds.
Warning: agreed, it should check e.g. whether the input tensor has 3 or more dims (and don't emit the warning it that case). Would you like to open a PR to fix it? :) (I think the same issue is present in TF and FLAX as well)
Generation: I've double-checked generation with input embeddings, and everything seems fine. Have a look at the example below
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("huggyllama/llama-7b")
tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
text = "Hello world"
input_ids = tokenizer.encode(text, return_tensors="pt")
# Traditional way of generating text
outputs = model.generate(input_ids)
print("\ngenerate + input_ids:", tokenizer.decode(outputs[0], skip_special_tokens=True))
# From inputs_embeds -- exact same output if you also pass `input_ids`. If you don't
# pass `input_ids`, you will get the same generated content but without the prompt
inputs_embeds = model.model.embed_tokens(input_ids)
outputs = model.generate(input_ids, inputs_embeds=inputs_embeds)
print("\ngenerate + inputs_embeds:", tokenizer.decode(outputs[0], skip_special_tokens=True))
@gante I confirmed once again and found that the input_embeds works. The problem was something I was doing with my embeddings. And yes, I'll create a PR for the warning.
Hey @zrthxn π Splitting my reply in two parts, the warning and the generation from input embeds.
Warning: agreed, it should check e.g. whether the input tensor has 3 or more dims (and don't emit the warning it that case). Would you like to open a PR to fix it? :) (I think the same issue is present in TF and FLAX as well)
Generation: I've double-checked generation with input embeddings, and everything seems fine. Have a look at the example below
from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("huggyllama/llama-7b") tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b") text = "Hello world" input_ids = tokenizer.encode(text, return_tensors="pt") # Traditional way of generating text outputs = model.generate(input_ids) print("\ngenerate + input_ids:", tokenizer.decode(outputs[0], skip_special_tokens=True)) # From inputs_embeds -- exact same output if you also pass `input_ids`. If you don't # pass `input_ids`, you will get the same generated content but without the prompt inputs_embeds = model.model.embed_tokens(input_ids) outputs = model.generate(input_ids, inputs_embeds=inputs_embeds) print("\ngenerate + inputs_embeds:", tokenizer.decode(outputs[0], skip_special_tokens=True))
I've tested out your example @gante and everythink works fine. However when i switch model to lmsys/vicuna-13b-v1.3 i'm getting error. Do you know what is the difference? I'm assuming that both models share the same implementations in transformers.models.llama.modeling_llama.LlamaForCausalLM.
My code
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"lmsys/vicuna-13b-v1.3",
load_in_8bit=True,
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("lmsys/vicuna-13b-v1.3")
text = "Hello world"
input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)
inputs_embeds = model.model.embed_tokens(input_ids)
outputs = model.generate(inputs_embeds=inputs_embeds, max_new_tokens=10)
print(
"\ngenerate + inputs_embeds:",
tokenizer.decode(outputs[0], skip_special_tokens=True),
)
Stack trace
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[3], line 5
2 input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)
4 inputs_embeds = model.model.embed_tokens(input_ids)
----> 5 outputs = model.generate(inputs_embeds=inputs_embeds, max_new_tokens=10)
6 print("\ngenerate + inputs_embeds:", tokenizer.decode(outputs[0], skip_special_tokens=True))
File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27), in _DecoratorContextManager.__call__..decorate_context(*args, **kwargs)
24 @functools.wraps(func)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)
File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/generation/utils.py:1522](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/generation/utils.py:1522), in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs)
1516 raise ValueError(
1517 "num_return_sequences has to be 1 when doing greedy search, "
1518 f"but is {generation_config.num_return_sequences}."
1519 )
1521 # 11. run greedy search
-> 1522 return self.greedy_search(
1523 input_ids,
1524 logits_processor=logits_processor,
1525 stopping_criteria=stopping_criteria,
1526 pad_token_id=generation_config.pad_token_id,
1527 eos_token_id=generation_config.eos_token_id,
1528 output_scores=generation_config.output_scores,
1529 return_dict_in_generate=generation_config.return_dict_in_generate,
1530 synced_gpus=synced_gpus,
1531 streamer=streamer,
1532 **model_kwargs,
1533 )
1535 elif is_contrastive_search_gen_mode:
1536 if generation_config.num_return_sequences > 1:
File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/generation/utils.py:2339](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/generation/utils.py:2339), in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
2336 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
2338 # forward pass to get next token
-> 2339 outputs = self(
2340 **model_inputs,
2341 return_dict=True,
2342 output_attentions=output_attentions,
2343 output_hidden_states=output_hidden_states,
2344 )
2346 if synced_gpus and this_peer_finished:
2347 continue # don't waste resources running the code we don't need
File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/torch/nn/modules/module.py:1194](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/torch/nn/modules/module.py:1194), in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/accelerate/hooks.py:165](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/accelerate/hooks.py:165), in add_hook_to_module..new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)
File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:688](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:688), in LlamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
685 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
687 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
--> 688 outputs = self.model(
689 input_ids=input_ids,
690 attention_mask=attention_mask,
691 position_ids=position_ids,
692 past_key_values=past_key_values,
693 inputs_embeds=inputs_embeds,
694 use_cache=use_cache,
695 output_attentions=output_attentions,
696 output_hidden_states=output_hidden_states,
697 return_dict=return_dict,
698 )
700 hidden_states = outputs[0]
701 logits = self.lm_head(hidden_states)
File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/torch/nn/modules/module.py:1194](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/torch/nn/modules/module.py:1194), in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/accelerate/hooks.py:165](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/accelerate/hooks.py:165), in add_hook_to_module..new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)
File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:528](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:528), in LlamaModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
526 position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
527 else:
--> 528 position_ids = position_ids.view(-1, seq_length).long()
530 if inputs_embeds is None:
531 inputs_embeds = self.embed_tokens(input_ids)
RuntimeError: shape '[-1, 3]' is invalid for input of size 4
@NorbertRop The issue is fixed in #24639 π (see the PR if you're curious about why it was breaking :) )
@NorbertRop should be fixed if you install from main
I also encountered this issue, using version 4.39.0
Exception in thread Thread-6:
Traceback (most recent call last):
File "/root/anaconda3/envs/sakura/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/root/anaconda3/envs/sakura/lib/python3.9/threading.py", line 892, in run
self._target(*self._args, **self._kwargs)
File "/root/anaconda3/envs/sakura/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/anaconda3/envs/sakura/lib/python3.9/site-packages/transformers/generation/utils.py", line 1383, in generate
and torch.sum(inputs_tensor[:, -1] == generation_config.pad_token_id) > 0
RuntimeError:
Run code is
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from threading import Thread
torch_device = "cuda:0"
tok = AutoTokenizer.from_pretrained("NeuralExperiment-7b-MagicCoder-v7.5")
model = AutoModelForCausalLM.from_pretrained("NeuralExperiment-7b-MagicCoder-v7.5").to(torch_device)
inputs = tok(["Edit the following Python code to write a heap sort function"], return_tensors="pt").to(torch_device)
streamer = TextIteratorStreamer(tok)
# Run the generation in a separate thread, so that we can fetch the generated text in a non-blocking way.
generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=512)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
generated_text = ""
for new_text in streamer:
generated_text += new_text
print(generated_text)
I am using an NPUοΌtested that the GPU can run successfully
@AnitaSherry π NPUs are tricker than GPUs, I will need a few more pointers!
- What's the full
RuntimeError? - Do you get the same error before v4.39?
@gante okοΌthe full RuntimeError is
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:09<00:00, 3.03s/it]
user:Edit the following Python code to write a heap sort function
Exception in thread Thread-6:
Traceback (most recent call last):
File "/root/anaconda3/envs/sakura/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/root/anaconda3/envs/sakura/lib/python3.9/threading.py", line 892, in run
self._target(*self._args, **self._kwargs)
File "/root/anaconda3/envs/sakura/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/anaconda3/envs/sakura/lib/python3.9/site-packages/transformers/generation/utils.py", line 1383, in generate
and torch.sum(inputs_tensor[:, -1] == generation_config.pad_token_id) > 0
RuntimeError: getDevice:torch_npu/csrc/aten/common/CopyKernel.cpp:41 NPU error, error code is 107002
[Error]: The context is empty.
Check whether acl.rt.set_context or acl.rt.set_device is called.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4686]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Then I annotated the 1377-1388 lines of the file
/root/anaconda3/envs/sakura/lib/python3.9/site-packages/transformers/generation/utils.py
The error displayed at this time is
user:Edit the following Python code to write a heap sort function
Exception in thread Thread-6:
Traceback (most recent call last):
File "/root/anaconda3/envs/sakura/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/root/anaconda3/envs/sakura/lib/python3.9/threading.py", line 892, in run
self._target(*self._args, **self._kwargs)
File "/root/anaconda3/envs/sakura/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/anaconda3/envs/sakura/lib/python3.9/site-packages/transformers/generation/utils.py", line 1411, in generate
streamer.put(input_ids.cpu())
RuntimeError: getDevice:torch_npu/csrc/aten/common/CopyKernel.cpp:41 NPU error, error code is 107002
[Error]: The context is empty.
Check whether acl.rt.set_context or acl.rt.set_device is called.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4686]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
The code I am running is
import torch
import torch_npu
import torchvision
import torchvision_npu
from torch_npu.contrib import transfer_to_npu
torch_device = "npu:3" # 0~7
torch.npu.set_device(torch.device(torch_device))
torch.npu.set_compile_mode(jit_compile=False)
option = {}
option["NPU_FUZZY_COMPILE_BLACKLIST"] = "Tril"
torch.npu.set_option(option)
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from transformers import TextStreamer,TextIteratorStreamer
import sys
from threading import Thread
model_id = "/my/model/path/gemma-2b-coder"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to(torch_device)
def generate(
instruction,
max_new_tokens=512,
temperature=0.1,
top_p=0.75,
top_k=40,
num_beams=2,
**kwargs,
):
torch.npu.set_device(torch.device(torch_device))
system = f"<bos><|system|>\nYou are a helpful coding assistant.<eos>\n"
prompt = f"{system}<|user|>\n{instruction}<eos>\n<|assistant|>\n"
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(torch_device)
attention_mask = inputs["attention_mask"].to(torch_device)
generation_config = GenerationConfig(
temperature=temperature,
top_p=top_p,
top_k=top_k,
num_beams=num_beams,
**kwargs,
)
with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
generation_config=generation_config,
return_dict_in_generate=True,
max_new_tokens=max_new_tokens,
early_stopping=True
)
s = generation_output.sequences[0]
output = tokenizer.decode(s, skip_special_tokens=True)
return output.split("<|assistant|>")[1]
def streamer_generate(
instruction,
max_new_tokens=512,
temperature=0.1,
top_p=0.75,
top_k=40,
num_beams=1,
**kwargs,
):
# system = f"<bos><|system|>\nYou are a helpful coding assistant.<eos>\n"
# prompt = f"{system}<|user|>\n{instruction}<eos>\n<|assistant|>\n"
prompt = instruction
inputs = tokenizer([prompt], return_tensors="pt").to(torch_device)
streamer = TextIteratorStreamer(tokenizer)
generation_kwargs = dict(inputs,
streamer=streamer,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
top_k=top_k,
**kwargs,)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
generated_text,position = "",0
for new_text in streamer:
generated_text += new_text
print(generated_text[position:], end='', flush=True)
position = len(generated_text)
if __name__ == "__main__":
while True:
user_input = input("user:")
if not user_input:
continue
if user_input == "exit":
print("Task is over.")
sys.exit()
streamer_generate(user_input)
generate() can run successfully
Test other versions of transformers packages immediately
@AnitaSherry that seems to be an NPU-related issue with our code. Sadly I am not familiar with NPUs, so I'm not sure how I can help :)
@zrthxn Hello, do you know how to achieve "batch generate" of Llama2 with inputs_embeds?
@gnehcoul to the best of my knowledge you could batch your sequences of embeds and it should just work. The shape of the input should be (Batch, Seq, Embedding size), so batch should be your first dimension.
You will also need to pad the sequences to the same length to create batches.