transformers icon indicating copy to clipboard operation
transformers copied to clipboard

InstructBlipProcessor not working with load_in_4bit and load_in_8bit

Open fraferra opened this issue 2 years ago • 4 comments

System Info

transformers @ git+https://github.com/huggingface/transformers@68c92981ff2b804979d2e6107eeefe298d1e5183 Python 3.11.4 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... Off | 00000000:00:04.0 Off | 0 | | N/A 36C P0 50W / 400W | 845MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 12365 C ...nda/envs/myenv/bin/python 843MiB | +-----------------------------------------------------------------------------+

Who can help?

No response

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Currently trying to run the following script:

from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image

torch.cuda.empty_cache()
model = InstructBlipForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-7b", load_in_4bit=True,  torch_dtype=torch.bfloat16)
processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b", load_in_4bit=True, torch_dtype=torch.bfloat16)

image = Image.open('examples/test1.jpeg')
inputs = processor(images=image, text='', return_tensors="pt").to(device)
outputs = model.generate(
        **inputs,
        do_sample=False,
        num_beams=5,
        max_length=256,
        min_length=1,
        top_p=0.9,
        repetition_penalty=1.5,
        length_penalty=1.0,
        temperature=1,
)

But obtaining the following error (see below). Is it possible that InstructBlipForConditionalGeneration does not support yet load_in_4bit?

Error logs:

RuntimeError                              Traceback (most recent call last)
Cell In[6], line 1
----> 1 outputs = model.generate(
      2         **inputs,
      3         do_sample=False,
      4         num_beams=5,
      5         max_length=256,
      6         min_length=1,
      7         top_p=0.9,
      8         repetition_penalty=1.5,
      9         length_penalty=1.0,
     10         temperature=1,
     11 )

File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/transformers/models/instructblip/modeling_instructblip.py:1517, in InstructBlipForConditionalGeneration.generate(self, pixel_values, qformer_input_ids, qformer_attention_mask, input_ids, attention_mask, **generate_kwargs)
   1514     self._preprocess_accelerate()
   1516 batch_size = pixel_values.shape[0]
-> 1517 image_embeds = self.vision_model(pixel_values, return_dict=True).last_hidden_state
   1519 image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long, device=image_embeds.device)
   1521 query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/envs/myenv/lib/python3.11/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/transformers/models/instructblip/modeling_instructblip.py:538, in InstructBlipVisionModel.forward(self, pixel_values, output_attentions, output_hidden_states, return_dict)
    535 if pixel_values is None:
    536     raise ValueError("You have to specify pixel_values")
--> 538 hidden_states = self.embeddings(pixel_values)
    540 encoder_outputs = self.encoder(
    541     inputs_embeds=hidden_states,
    542     output_attentions=output_attentions,
    543     output_hidden_states=output_hidden_states,
    544     return_dict=return_dict,
    545 )
    547 last_hidden_state = encoder_outputs[0]

File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/envs/myenv/lib/python3.11/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/transformers/models/instructblip/modeling_instructblip.py:113, in InstructBlipVisionEmbeddings.forward(self, pixel_values)
    111 batch_size = pixel_values.shape[0]
    112 target_dtype = self.patch_embedding.weight.dtype
--> 113 patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
    114 patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
    116 class_embeds = self.class_embedding.expand(batch_size, 1, -1).to(target_dtype)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/envs/myenv/lib/python3.11/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/conv.py:463, in Conv2d.forward(self, input)
    462 def forward(self, input: Tensor) -> Tensor:
--> 463     return self._conv_forward(input, self.weight, self.bias)

File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/conv.py:459, in Conv2d._conv_forward(self, input, weight, bias)
    455 if self.padding_mode != 'zeros':
    456     return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
    457                     weight, bias, self.stride,
    458                     _pair(0), self.dilation, self.groups)
--> 459 return F.conv2d(input, weight, bias, self.stride,
    460                 self.padding, self.dilation, self.groups)

RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

Expected behavior

Produce output string as expected

fraferra avatar Jun 29 '23 00:06 fraferra

cc @younesbelkada

sgugger avatar Jun 29 '23 00:06 sgugger

Hi @fraferra In https://github.com/huggingface/transformers/pull/24555 I have fixed the a silent issue with processors that you are currently facing, can you try to install transformers from source and run:

from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image

torch.cuda.empty_cache()
model = InstructBlipForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-7b", load_in_4bit=True,  torch_dtype=torch.bfloat16)
processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b", load_in_4bit=True, torch_dtype=torch.bfloat16)

image = Image.open('examples/test1.jpeg')
inputs = processor(images=image, text='', return_tensors="pt").to(device, torch.bfloat16)
outputs = model.generate(
        **inputs,
        do_sample=False,
        num_beams=5,
        max_length=256,
        min_length=1,
        top_p=0.9,
        repetition_penalty=1.5,
        length_penalty=1.0,
        temperature=1,
)

Check a more concrete example here: https://github.com/huggingface/transformers/blob/66954ea25e342fd451c26ec1c295da0b8692086b/tests/models/instructblip/test_modeling_instructblip.py#L524

younesbelkada avatar Jun 29 '23 07:06 younesbelkada

Thank you @younesbelkada for looking into it! Is it possible that BatchEncoding.to() needs to be updated? I can see in the source code that BatchEncoding.to() (https://github.com/huggingface/transformers/blob/9e28750287df57942d716083ae53bb4e766104c2/src/transformers/tokenization_utils_base.py#L756) only takes 1 argument.

I am getting the following error when trying to run your code snippet:

     1 device = "cuda" if torch.cuda.is_available() else "cpu"
      2 image = Image.open('examples/test1.jpeg')
----> 3 inputs = processor(images=image, text='', return_tensors="pt").to(device, torch.bfloat16)
      4 outputs = model.generate(
      5         **inputs,
      6         do_sample=False,
   (...)
     13         temperature=1,
     14 )

TypeError: BatchEncoding.to() takes 2 positional arguments but 3 were given

fraferra avatar Jun 29 '23 19:06 fraferra

Pretty weird since InstructBlipProcessor.__call__ should return BatchFeature which the to method can take *args, **kwargs unlike the one from BatchEncoding which only takes device as an argument.

Bearnardd avatar Jun 30 '23 17:06 Bearnardd

Hi @fraferra Can you install transformers from the main branch and try again?

pip uninstall transformers
pip install git+https://github.com/huggingface/transformers.git

younesbelkada avatar Jul 06 '23 07:07 younesbelkada

@younesbelkada it worked, thank you! for some reason it wouldnt update to the latest transformers' version in the conda env. After uninstalling it and reinstalling it after it returned BatchFeature

fraferra avatar Jul 06 '23 14:07 fraferra

Thank you @fraferra feel free to close the issue ! Let us know if you have more questions

younesbelkada avatar Jul 06 '23 14:07 younesbelkada

as the vision is a model, and uses the llm model vicunia or optB7 etc... would there be a way to just use the already loaded model to ask a txt question and get a text answer irrelevant to the image? just use llm as an llm?

smc-git avatar Aug 01 '23 12:08 smc-git