transformers InstructBlipProcessor not working with load_in_4bit and load_in

System Info

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 12365 C ...nda/envs/myenv/bin/python 843MiB | +-----------------------------------------------------------------------------+

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Currently trying to run the following script:

from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image

torch.cuda.empty_cache()
model = InstructBlipForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-7b", load_in_4bit=True,  torch_dtype=torch.bfloat16)
processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b", load_in_4bit=True, torch_dtype=torch.bfloat16)

image = Image.open('examples/test1.jpeg')
inputs = processor(images=image, text='', return_tensors="pt").to(device)
outputs = model.generate(
        **inputs,
        do_sample=False,
        num_beams=5,
        max_length=256,
        min_length=1,
        top_p=0.9,
        repetition_penalty=1.5,
        length_penalty=1.0,
        temperature=1,
)

But obtaining the following error (see below). Is it possible that InstructBlipForConditionalGeneration does not support yet load_in_4bit?

Error logs:

RuntimeError                              Traceback (most recent call last)
Cell In[6], line 1
----> 1 outputs = model.generate(
      2         **inputs,
      3         do_sample=False,
      4         num_beams=5,
      5         max_length=256,
      6         min_length=1,
      7         top_p=0.9,
      8         repetition_penalty=1.5,
      9         length_penalty=1.0,
     10         temperature=1,
     11 )

File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/transformers/models/instructblip/modeling_instructblip.py:1517, in InstructBlipForConditionalGeneration.generate(self, pixel_values, qformer_input_ids, qformer_attention_mask, input_ids, attention_mask, **generate_kwargs)
   1514     self._preprocess_accelerate()
   1516 batch_size = pixel_values.shape[0]
-> 1517 image_embeds = self.vision_model(pixel_values, return_dict=True).last_hidden_state
   1519 image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long, device=image_embeds.device)
   1521 query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/envs/myenv/lib/python3.11/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/transformers/models/instructblip/modeling_instructblip.py:538, in InstructBlipVisionModel.forward(self, pixel_values, output_attentions, output_hidden_states, return_dict)
    535 if pixel_values is None:
    536     raise ValueError("You have to specify pixel_values")
--> 538 hidden_states = self.embeddings(pixel_values)
    540 encoder_outputs = self.encoder(
    541     inputs_embeds=hidden_states,
    542     output_attentions=output_attentions,
    543     output_hidden_states=output_hidden_states,
    544     return_dict=return_dict,
    545 )
    547 last_hidden_state = encoder_outputs[0]

File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/envs/myenv/lib/python3.11/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/transformers/models/instructblip/modeling_instructblip.py:113, in InstructBlipVisionEmbeddings.forward(self, pixel_values)
    111 batch_size = pixel_values.shape[0]
    112 target_dtype = self.patch_embedding.weight.dtype
--> 113 patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
    114 patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
    116 class_embeds = self.class_embedding.expand(batch_size, 1, -1).to(target_dtype)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/envs/myenv/lib/python3.11/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/conv.py:463, in Conv2d.forward(self, input)
    462 def forward(self, input: Tensor) -> Tensor:
--> 463     return self._conv_forward(input, self.weight, self.bias)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/conv.py:459, in Conv2d._conv_forward(self, input, weight, bias)
    455 if self.padding_mode != 'zeros':
    456     return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
    457                     weight, bias, self.stride,
    458                     _pair(0), self.dilation, self.groups)
--> 459 return F.conv2d(input, weight, bias, self.stride,
    460                 self.padding, self.dilation, self.groups)

RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

Expected behavior

Produce output string as expected

Jun 29 '23 00:06 fraferra

cc @younesbelkada

Jun 29 '23 00:06 sgugger

Hi @fraferra In https://github.com/huggingface/transformers/pull/24555 I have fixed the a silent issue with processors that you are currently facing, can you try to install transformers from source and run:

from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image

torch.cuda.empty_cache()
model = InstructBlipForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-7b", load_in_4bit=True,  torch_dtype=torch.bfloat16)
processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b", load_in_4bit=True, torch_dtype=torch.bfloat16)

image = Image.open('examples/test1.jpeg')
inputs = processor(images=image, text='', return_tensors="pt").to(device, torch.bfloat16)
outputs = model.generate(
        **inputs,
        do_sample=False,
        num_beams=5,
        max_length=256,
        min_length=1,
        top_p=0.9,
        repetition_penalty=1.5,
        length_penalty=1.0,
        temperature=1,
)

Check a more concrete example here: https://github.com/huggingface/transformers/blob/66954ea25e342fd451c26ec1c295da0b8692086b/tests/models/instructblip/test_modeling_instructblip.py#L524

Jun 29 '23 07:06 younesbelkada

Thank you @younesbelkada for looking into it! Is it possible that BatchEncoding.to() needs to be updated? I can see in the source code that BatchEncoding.to() (https://github.com/huggingface/transformers/blob/9e28750287df57942d716083ae53bb4e766104c2/src/transformers/tokenization_utils_base.py#L756) only takes 1 argument.

I am getting the following error when trying to run your code snippet:

     1 device = "cuda" if torch.cuda.is_available() else "cpu"
      2 image = Image.open('examples/test1.jpeg')
----> 3 inputs = processor(images=image, text='', return_tensors="pt").to(device, torch.bfloat16)
      4 outputs = model.generate(
      5         **inputs,
      6         do_sample=False,
   (...)
     13         temperature=1,
     14 )

TypeError: BatchEncoding.to() takes 2 positional arguments but 3 were given

Jun 29 '23 19:06 fraferra

Pretty weird since InstructBlipProcessor.__call__ should return BatchFeature which the to method can take *args, **kwargs unlike the one from BatchEncoding which only takes device as an argument.

Jun 30 '23 17:06 Bearnardd

Hi @fraferra Can you install transformers from the main branch and try again?

pip uninstall transformers
pip install git+https://github.com/huggingface/transformers.git

Jul 06 '23 07:07 younesbelkada

@younesbelkada it worked, thank you! for some reason it wouldnt update to the latest transformers' version in the conda env. After uninstalling it and reinstalling it after it returned BatchFeature

Jul 06 '23 14:07 fraferra

Thank you @fraferra feel free to close the issue ! Let us know if you have more questions

Jul 06 '23 14:07 younesbelkada

as the vision is a model, and uses the llm model vicunia or optB7 etc... would there be a way to just use the already loaded model to ask a txt question and get a text answer irrelevant to the image? just use llm as an llm?

Aug 01 '23 12:08 smc-git

InstructBlipProcessor not working with load_in_4bit and load_in_8bit

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior