transformers Using `inputs_embeds` for generation creates different results (and gives a warning)

I'm trying to use the inputs_embeds parameter to run the LLaMA model. This is part of my code.

    # INPUT = ...embedding of a sequence, ensuring that there are no pad tokens
    output_sequences = LLaMA.generate(
        inputs_embeds=INPUT.to(device)
        pad_token_id=tokenizer.pad_token_id,
        # ... generation parameters, top_p top_k etc.
    )

I keep getting this warning, and the results are complete gibberish. I know this exact model performs well if I pass input_ids.

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left' when initializing the tokenizer.

After a lot of debugging, I found that this issue is because of the transformers library itself. The generate function checks that the last token ID in every batch should not be the pad token ID. If it is, it displays this warning.

https://github.com/huggingface/transformers/blob/a0e733283930bdb9ae2b1afdc53ec5f2daefb033/src/transformers/generation/utils.py#L1308-L1315

The generate function is expecting the shape (Batch, Sequence) where this logic would work.

inputs_tensor[:, -1] == generation_config.pad_token_id

Now the problem is that I am passing inputs_embeds not IDs. My shape is (Batch, Sequence, EmbeddingSize), so the above statement would be true if there are any zeros in the embedding of the last token. This is obviously incorrect.

That explains the warning but not the incorrect generation.

Environment

transformers==4.28.0
Python 3.10.11

Apr 28 '23 07:04 zrthxn

cc @gante

Apr 28 '23 13:04 sgugger

Hey @zrthxn 👋 Splitting my reply in two parts, the warning and the generation from input embeds.

Warning: agreed, it should check e.g. whether the input tensor has 3 or more dims (and don't emit the warning it that case). Would you like to open a PR to fix it? :) (I think the same issue is present in TF and FLAX as well)

Generation: I've double-checked generation with input embeddings, and everything seems fine. Have a look at the example below

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("huggyllama/llama-7b")
tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")

text = "Hello world"
input_ids = tokenizer.encode(text, return_tensors="pt")

# Traditional way of generating text
outputs = model.generate(input_ids)
print("\ngenerate + input_ids:", tokenizer.decode(outputs[0], skip_special_tokens=True))

# From inputs_embeds -- exact same output if you also pass `input_ids`. If you don't
# pass `input_ids`, you will get the same generated content but without the prompt
inputs_embeds = model.model.embed_tokens(input_ids)
outputs = model.generate(input_ids, inputs_embeds=inputs_embeds)
print("\ngenerate + inputs_embeds:", tokenizer.decode(outputs[0], skip_special_tokens=True))

May 03 '23 10:05 gante

@gante I confirmed once again and found that the input_embeds works. The problem was something I was doing with my embeddings. And yes, I'll create a PR for the warning.

May 03 '23 12:05 zrthxn

Hey @zrthxn 👋 Splitting my reply in two parts, the warning and the generation from input embeds.

Warning: agreed, it should check e.g. whether the input tensor has 3 or more dims (and don't emit the warning it that case). Would you like to open a PR to fix it? :) (I think the same issue is present in TF and FLAX as well)

Generation: I've double-checked generation with input embeddings, and everything seems fine. Have a look at the example below
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("huggyllama/llama-7b")
tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")

text = "Hello world"
input_ids = tokenizer.encode(text, return_tensors="pt")

# Traditional way of generating text
outputs = model.generate(input_ids)
print("\ngenerate + input_ids:", tokenizer.decode(outputs[0], skip_special_tokens=True))

# From inputs_embeds -- exact same output if you also pass `input_ids`. If you don't
# pass `input_ids`, you will get the same generated content but without the prompt
inputs_embeds = model.model.embed_tokens(input_ids)
outputs = model.generate(input_ids, inputs_embeds=inputs_embeds)
print("\ngenerate + inputs_embeds:", tokenizer.decode(outputs[0], skip_special_tokens=True))

I've tested out your example @gante and everythink works fine. However when i switch model to lmsys/vicuna-13b-v1.3 i'm getting error. Do you know what is the difference? I'm assuming that both models share the same implementations in transformers.models.llama.modeling_llama.LlamaForCausalLM.

My code

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "lmsys/vicuna-13b-v1.3",
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("lmsys/vicuna-13b-v1.3")


text = "Hello world"
input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)

inputs_embeds = model.model.embed_tokens(input_ids)
outputs = model.generate(inputs_embeds=inputs_embeds, max_new_tokens=10)
print(
    "\ngenerate + inputs_embeds:",
    tokenizer.decode(outputs[0], skip_special_tokens=True),
)

Stack trace

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[3], line 5
      2 input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)
      4 inputs_embeds = model.model.embed_tokens(input_ids)
----> 5 outputs = model.generate(inputs_embeds=inputs_embeds, max_new_tokens=10)
      6 print("\ngenerate + inputs_embeds:", tokenizer.decode(outputs[0], skip_special_tokens=True))

File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27), in _DecoratorContextManager.__call__..decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/generation/utils.py:1522](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/generation/utils.py:1522), in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs)
   1516         raise ValueError(
   1517             "num_return_sequences has to be 1 when doing greedy search, "
   1518             f"but is {generation_config.num_return_sequences}."
   1519         )
   1521     # 11. run greedy search
-> 1522     return self.greedy_search(
   1523         input_ids,
   1524         logits_processor=logits_processor,
   1525         stopping_criteria=stopping_criteria,
   1526         pad_token_id=generation_config.pad_token_id,
   1527         eos_token_id=generation_config.eos_token_id,
   1528         output_scores=generation_config.output_scores,
   1529         return_dict_in_generate=generation_config.return_dict_in_generate,
   1530         synced_gpus=synced_gpus,
   1531         streamer=streamer,
   1532         **model_kwargs,
   1533     )
   1535 elif is_contrastive_search_gen_mode:
   1536     if generation_config.num_return_sequences > 1:

File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/generation/utils.py:2339](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/generation/utils.py:2339), in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   2336 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2338 # forward pass to get next token
-> 2339 outputs = self(
   2340     **model_inputs,
   2341     return_dict=True,
   2342     output_attentions=output_attentions,
   2343     output_hidden_states=output_hidden_states,
   2344 )
   2346 if synced_gpus and this_peer_finished:
   2347     continue  # don't waste resources running the code we don't need

File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/torch/nn/modules/module.py:1194](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/torch/nn/modules/module.py:1194), in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/accelerate/hooks.py:165](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/accelerate/hooks.py:165), in add_hook_to_module..new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:688](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:688), in LlamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
    685 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    687 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
--> 688 outputs = self.model(
    689     input_ids=input_ids,
    690     attention_mask=attention_mask,
    691     position_ids=position_ids,
    692     past_key_values=past_key_values,
    693     inputs_embeds=inputs_embeds,
    694     use_cache=use_cache,
    695     output_attentions=output_attentions,
    696     output_hidden_states=output_hidden_states,
    697     return_dict=return_dict,
    698 )
    700 hidden_states = outputs[0]
    701 logits = self.lm_head(hidden_states)

File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/torch/nn/modules/module.py:1194](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/torch/nn/modules/module.py:1194), in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/accelerate/hooks.py:165](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/accelerate/hooks.py:165), in add_hook_to_module..new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File [~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:528](https://vscode-remote+ssh-002dremote-002bjaskier.vscode-resource.vscode-cdn.net/home/nropiak/git/InstructZero/~/miniconda3/envs/InstructZero/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:528), in LlamaModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
    526     position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
    527 else:
--> 528     position_ids = position_ids.view(-1, seq_length).long()
    530 if inputs_embeds is None:
    531     inputs_embeds = self.embed_tokens(input_ids)

RuntimeError: shape '[-1, 3]' is invalid for input of size 4

Jul 03 '23 14:07 NorbertRop

@NorbertRop The issue is fixed in #24639 🙌 (see the PR if you're curious about why it was breaking :) )

Jul 03 '23 16:07 gante

@NorbertRop should be fixed if you install from main

Jul 03 '23 17:07 gante

I also encountered this issue, using version 4.39.0

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/root/anaconda3/envs/sakura/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/root/anaconda3/envs/sakura/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.9/site-packages/transformers/generation/utils.py", line 1383, in generate
    and torch.sum(inputs_tensor[:, -1] == generation_config.pad_token_id) > 0
RuntimeError:

Run code is

from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from threading import Thread
torch_device = "cuda:0"
tok = AutoTokenizer.from_pretrained("NeuralExperiment-7b-MagicCoder-v7.5")
model = AutoModelForCausalLM.from_pretrained("NeuralExperiment-7b-MagicCoder-v7.5").to(torch_device)
inputs = tok(["Edit the following Python code to write a heap sort function"], return_tensors="pt").to(torch_device)
streamer = TextIteratorStreamer(tok)

# Run the generation in a separate thread, so that we can fetch the generated text in a non-blocking way.
generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=512)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
generated_text = ""
for new_text in streamer:
    generated_text += new_text
    print(generated_text)

Mar 22 '24 09:03 AnitaSherry

I am using an NPU，tested that the GPU can run successfully

Mar 22 '24 09:03 AnitaSherry

@AnitaSherry 👋 NPUs are tricker than GPUs, I will need a few more pointers!

What's the full RuntimeError?
Do you get the same error before v4.39?

Mar 25 '24 11:03 gante

@gante ok，the full RuntimeError is

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:09<00:00,  3.03s/it]
user:Edit the following Python code to write a heap sort function
Exception in thread Thread-6:
Traceback (most recent call last):
  File "/root/anaconda3/envs/sakura/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/root/anaconda3/envs/sakura/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.9/site-packages/transformers/generation/utils.py", line 1383, in generate
    and torch.sum(inputs_tensor[:, -1] == generation_config.pad_token_id) > 0
RuntimeError: getDevice:torch_npu/csrc/aten/common/CopyKernel.cpp:41 NPU error, error code is 107002
[Error]: The context is empty.
        Check whether acl.rt.set_context or acl.rt.set_device is called.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4686]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

Then I annotated the 1377-1388 lines of the file /root/anaconda3/envs/sakura/lib/python3.9/site-packages/transformers/generation/utils.py The error displayed at this time is

user:Edit the following Python code to write a heap sort function
Exception in thread Thread-6:
Traceback (most recent call last):
  File "/root/anaconda3/envs/sakura/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/root/anaconda3/envs/sakura/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/sakura/lib/python3.9/site-packages/transformers/generation/utils.py", line 1411, in generate
    streamer.put(input_ids.cpu())
RuntimeError: getDevice:torch_npu/csrc/aten/common/CopyKernel.cpp:41 NPU error, error code is 107002
[Error]: The context is empty.
        Check whether acl.rt.set_context or acl.rt.set_device is called.
EE1001: The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4686]
        The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

The code I am running is

import torch
import torch_npu
import torchvision 
import torchvision_npu
from torch_npu.contrib import transfer_to_npu
torch_device = "npu:3" # 0~7
torch.npu.set_device(torch.device(torch_device))
torch.npu.set_compile_mode(jit_compile=False)
option = {}
option["NPU_FUZZY_COMPILE_BLACKLIST"] = "Tril"
torch.npu.set_option(option)
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from transformers import TextStreamer,TextIteratorStreamer
import sys
from threading import Thread

model_id = "/my/model/path/gemma-2b-coder"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id).to(torch_device)

def generate(
        instruction,
        max_new_tokens=512,
        temperature=0.1,
        top_p=0.75,
        top_k=40,
        num_beams=2,
        **kwargs,
):
    torch.npu.set_device(torch.device(torch_device))
    system = f"<bos><|system|>\nYou are a helpful coding assistant.<eos>\n"
    prompt = f"{system}<|user|>\n{instruction}<eos>\n<|assistant|>\n"
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to(torch_device)
    attention_mask = inputs["attention_mask"].to(torch_device)
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        num_beams=num_beams,
        **kwargs,
    )
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            generation_config=generation_config,
            return_dict_in_generate=True,
            max_new_tokens=max_new_tokens,
            early_stopping=True
        )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s, skip_special_tokens=True)
    return output.split("<|assistant|>")[1]

def streamer_generate(
        instruction,
        max_new_tokens=512,
        temperature=0.1,
        top_p=0.75,
        top_k=40,
        num_beams=1,
        **kwargs,
):
    # system = f"<bos><|system|>\nYou are a helpful coding assistant.<eos>\n"
    # prompt = f"{system}<|user|>\n{instruction}<eos>\n<|assistant|>\n"
    prompt = instruction
    inputs = tokenizer([prompt], return_tensors="pt").to(torch_device)
    streamer = TextIteratorStreamer(tokenizer)
    generation_kwargs = dict(inputs,
                            streamer=streamer,
                            max_new_tokens=max_new_tokens,
                            temperature=temperature,
                            top_p=top_p,
                            top_k=top_k,
                            **kwargs,)
    
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    generated_text,position = "",0
    for new_text in streamer:
        generated_text += new_text
        print(generated_text[position:], end='', flush=True)
        position = len(generated_text)

if __name__ == "__main__":
    while True:
        user_input = input("user:")
        if not user_input:
            continue
        if user_input == "exit":
            print("Task is over.")
            sys.exit()
        streamer_generate(user_input)

generate() can run successfully

Mar 26 '24 00:03 AnitaSherry

Test other versions of transformers packages immediately

Mar 26 '24 01:03 AnitaSherry

@AnitaSherry that seems to be an NPU-related issue with our code. Sadly I am not familiar with NPUs, so I'm not sure how I can help :)

Mar 27 '24 11:03 gante

@zrthxn Hello, do you know how to achieve "batch generate" of Llama2 with inputs_embeds?

Mar 29 '24 02:03 gnehcoul

@gnehcoul to the best of my knowledge you could batch your sequences of embeds and it should just work. The shape of the input should be (Batch, Seq, Embedding size), so batch should be your first dimension. You will also need to pad the sequences to the same length to create batches.

Mar 29 '24 13:03 zrthxn