InstructBlipProcessor not working with load_in_4bit and load_in_8bit
System Info
transformers @ git+https://github.com/huggingface/transformers@68c92981ff2b804979d2e6107eeefe298d1e5183 Python 3.11.4 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... Off | 00000000:00:04.0 Off | 0 | | N/A 36C P0 50W / 400W | 845MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 12365 C ...nda/envs/myenv/bin/python 843MiB | +-----------------------------------------------------------------------------+
Who can help?
No response
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Currently trying to run the following script:
from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image
torch.cuda.empty_cache()
model = InstructBlipForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-7b", load_in_4bit=True, torch_dtype=torch.bfloat16)
processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b", load_in_4bit=True, torch_dtype=torch.bfloat16)
image = Image.open('examples/test1.jpeg')
inputs = processor(images=image, text='', return_tensors="pt").to(device)
outputs = model.generate(
**inputs,
do_sample=False,
num_beams=5,
max_length=256,
min_length=1,
top_p=0.9,
repetition_penalty=1.5,
length_penalty=1.0,
temperature=1,
)
But obtaining the following error (see below). Is it possible that InstructBlipForConditionalGeneration does not support yet load_in_4bit?
Error logs:
RuntimeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 outputs = model.generate(
2 **inputs,
3 do_sample=False,
4 num_beams=5,
5 max_length=256,
6 min_length=1,
7 top_p=0.9,
8 repetition_penalty=1.5,
9 length_penalty=1.0,
10 temperature=1,
11 )
File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File /opt/conda/envs/myenv/lib/python3.11/site-packages/transformers/models/instructblip/modeling_instructblip.py:1517, in InstructBlipForConditionalGeneration.generate(self, pixel_values, qformer_input_ids, qformer_attention_mask, input_ids, attention_mask, **generate_kwargs)
1514 self._preprocess_accelerate()
1516 batch_size = pixel_values.shape[0]
-> 1517 image_embeds = self.vision_model(pixel_values, return_dict=True).last_hidden_state
1519 image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long, device=image_embeds.device)
1521 query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1)
File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/envs/myenv/lib/python3.11/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)
File /opt/conda/envs/myenv/lib/python3.11/site-packages/transformers/models/instructblip/modeling_instructblip.py:538, in InstructBlipVisionModel.forward(self, pixel_values, output_attentions, output_hidden_states, return_dict)
535 if pixel_values is None:
536 raise ValueError("You have to specify pixel_values")
--> 538 hidden_states = self.embeddings(pixel_values)
540 encoder_outputs = self.encoder(
541 inputs_embeds=hidden_states,
542 output_attentions=output_attentions,
543 output_hidden_states=output_hidden_states,
544 return_dict=return_dict,
545 )
547 last_hidden_state = encoder_outputs[0]
File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/envs/myenv/lib/python3.11/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)
File /opt/conda/envs/myenv/lib/python3.11/site-packages/transformers/models/instructblip/modeling_instructblip.py:113, in InstructBlipVisionEmbeddings.forward(self, pixel_values)
111 batch_size = pixel_values.shape[0]
112 target_dtype = self.patch_embedding.weight.dtype
--> 113 patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid]
114 patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
116 class_embeds = self.class_embedding.expand(batch_size, 1, -1).to(target_dtype)
File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /opt/conda/envs/myenv/lib/python3.11/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
163 output = old_forward(*args, **kwargs)
164 else:
--> 165 output = old_forward(*args, **kwargs)
166 return module._hf_hook.post_forward(module, output)
File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/conv.py:463, in Conv2d.forward(self, input)
462 def forward(self, input: Tensor) -> Tensor:
--> 463 return self._conv_forward(input, self.weight, self.bias)
File /opt/conda/envs/myenv/lib/python3.11/site-packages/torch/nn/modules/conv.py:459, in Conv2d._conv_forward(self, input, weight, bias)
455 if self.padding_mode != 'zeros':
456 return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
457 weight, bias, self.stride,
458 _pair(0), self.dilation, self.groups)
--> 459 return F.conv2d(input, weight, bias, self.stride,
460 self.padding, self.dilation, self.groups)
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same
Expected behavior
Produce output string as expected
cc @younesbelkada
Hi @fraferra In https://github.com/huggingface/transformers/pull/24555 I have fixed the a silent issue with processors that you are currently facing, can you try to install transformers from source and run:
from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image
torch.cuda.empty_cache()
model = InstructBlipForConditionalGeneration.from_pretrained("Salesforce/instructblip-vicuna-7b", load_in_4bit=True, torch_dtype=torch.bfloat16)
processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b", load_in_4bit=True, torch_dtype=torch.bfloat16)
image = Image.open('examples/test1.jpeg')
inputs = processor(images=image, text='', return_tensors="pt").to(device, torch.bfloat16)
outputs = model.generate(
**inputs,
do_sample=False,
num_beams=5,
max_length=256,
min_length=1,
top_p=0.9,
repetition_penalty=1.5,
length_penalty=1.0,
temperature=1,
)
Check a more concrete example here: https://github.com/huggingface/transformers/blob/66954ea25e342fd451c26ec1c295da0b8692086b/tests/models/instructblip/test_modeling_instructblip.py#L524
Thank you @younesbelkada for looking into it! Is it possible that BatchEncoding.to() needs to be updated?
I can see in the source code that BatchEncoding.to() (https://github.com/huggingface/transformers/blob/9e28750287df57942d716083ae53bb4e766104c2/src/transformers/tokenization_utils_base.py#L756) only takes 1 argument.
I am getting the following error when trying to run your code snippet:
1 device = "cuda" if torch.cuda.is_available() else "cpu"
2 image = Image.open('examples/test1.jpeg')
----> 3 inputs = processor(images=image, text='', return_tensors="pt").to(device, torch.bfloat16)
4 outputs = model.generate(
5 **inputs,
6 do_sample=False,
(...)
13 temperature=1,
14 )
TypeError: BatchEncoding.to() takes 2 positional arguments but 3 were given
Pretty weird since InstructBlipProcessor.__call__ should return BatchFeature which the to method can take *args, **kwargs unlike the one from BatchEncoding which only takes device as an argument.
Hi @fraferra Can you install transformers from the main branch and try again?
pip uninstall transformers
pip install git+https://github.com/huggingface/transformers.git
@younesbelkada it worked, thank you! for some reason it wouldnt update to the latest transformers' version in the conda env. After uninstalling it and reinstalling it after it returned BatchFeature
Thank you @fraferra feel free to close the issue ! Let us know if you have more questions
as the vision is a model, and uses the llm model vicunia or optB7 etc... would there be a way to just use the already loaded model to ask a txt question and get a text answer irrelevant to the image? just use llm as an llm?