transformers icon indicating copy to clipboard operation
transformers copied to clipboard

MllamaForCausalLM not returning past_key_values even with use_cache=True

Open Zhangyanbo opened this issue 1 year ago • 3 comments

System Info

  • transformers version: 4.45.2
  • Platform: Linux-5.4.0-187-generic-x86_64-with-glibc2.31
  • Python version: 3.11.5
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.3
  • Accelerate version: 0.33.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA RTX 6000 Ada Generation

Who can help?

@amyeroberts @ArthurZucker

Information

  • [ ] The official example scripts
  • [x] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [x] My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn as nn
import os
import matplotlib.pyplot as plt
from promptmix import generate_probmix
from transformers import MllamaForConditionalGeneration, AutoProcessor


model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct" # https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct


tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    cache_dir=cache_directory,
    torch_dtype=torch.float16,
    device_map='auto',
    )

processor = AutoProcessor.from_pretrained(model_id, cache_dir=cache_directory)

input_ids = tokenizer.encode('Hi, tell me a story of frog.', add_special_tokens=False, return_tensors='pt').to(model.device)

with torch.no_grad():
    output = model.forward(input_ids=input_ids, use_cache=True)
output

Expected behavior

I expect to see a past_key_values in the output. However, I got None.

Zhangyanbo avatar Oct 17 '24 00:10 Zhangyanbo

Hmm, actually we did Mllama quite similar to Idefics so the cache is not initialized by default when "use-cache=True". And yes, I think makes sense to init an empty cache if those are not special models like Gemma with special cache

Until the fix is there you can get pask-kv by passing model(**inputs, past_key_values=DynamicCache(), use_cache=True) but I see that the model weights will not be loaded proper way for CausalModel. In fact the ConditionalModel can deal with text-only input so for proper logits computation i'd recommend to use the ConditionalModel :)

zucchini-nlp avatar Oct 17 '24 13:10 zucchini-nlp

With use_cache we should probably just init a default cache for the user, or we opt for forcing users to pass a cache object

ArthurZucker avatar Oct 17 '24 15:10 ArthurZucker

Yes, exactly. I can make a PR for that

zucchini-nlp avatar Oct 17 '24 15:10 zucchini-nlp

not stale, PR under progress

zucchini-nlp avatar Nov 18 '24 10:11 zucchini-nlp