Unable to use load_in_8bit when the model is shared between GPU and CPU
It seems like bitsandbytes can't be used if the model is shared between GPU and CPU.
I could not find any info saying that the entire model must be loaded in GPU in order to use bitsandbytes,
so I'm not sure if this is a bug or the expected behavior.
The environment setup:
pip install --extra-index-url https://download.pytorch.org/whl/cu116 torch==1.12.1+cu116
pip install transformers==4.22.1
pip install accelerate==0.12.0
pip install bitsandbytes==0.33.1
The main.py script:
from transformers import pipeline
auto_map = False
load_in_8bit = True
if auto_map:
device_map = "auto"
else:
device_map = {
"transformer.wte": 0,
"transformer.wpe": 0,
"transformer.ln_f": "cpu",
"lm_head": 0,
"transformer.h.0": 0,
"transformer.h.1": "cpu",
"transformer.h.2": "cpu",
"transformer.h.3": "cpu",
"transformer.h.4": "cpu",
"transformer.h.5": "cpu",
"transformer.h.6": "cpu",
"transformer.h.7": "cpu",
"transformer.h.8": "cpu",
"transformer.h.9": "cpu",
"transformer.h.10": "cpu",
"transformer.h.11": "cpu"
}
pipe = pipeline(
model="EleutherAI/gpt-neo-125M",
max_length=32,
model_kwargs={
"device_map": device_map,
"load_in_8bit": load_in_8bit
}
)
print("\n", pipe("It was")[0]["generated_text"])
The auto_map and load_in_8bit control the script settings.
When you run the script with auto_map = False and load_in_8bit = True then it crashes with this error:
❯ python main.py
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1.44k/1.44k [00:00<00:00, 634kB/s]
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:20: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/home/user/.gtkrc'), PosixPath('/etc/gtk/gtkrc')}
[... lots of similar warnings about non-existent paths ...]
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Traceback (most recent call last):
File "/home/user/test/bnb-test/main.py", line 37, in <module>
print("\n", pipe("It was")[0]["generated_text"])
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 176, in __call__
return super().__call__(text_inputs, **kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1074, in __call__
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1081, in run_single
model_outputs = self.forward(model_inputs, **forward_params)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 990, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 218, in _forward
generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/generation_utils.py", line 1319, in generate
return self.greedy_search(
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/generation_utils.py", line 1713, in greedy_search
outputs = self(
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
output = old_forward(*args, **kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 744, in forward
transformer_outputs = self.transformer(
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 623, in forward
outputs = block(
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
output = old_forward(*args, **kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 328, in forward
attn_outputs = self.attn(
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
output = old_forward(*args, **kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 280, in forward
return self.attention(
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
output = old_forward(*args, **kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 224, in forward
query = self.q_proj(hidden_states)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
output = old_forward(*args, **kwargs)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 256, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 391, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 254, in forward
state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1604, in transform
prev_device = pre_call(A.device)
AttributeError: 'NoneType' object has no attribute 'device'
All other combinations of auto_map and load_in_8bit produce no error and give the generated_text.
Hey !
Thanks for your message,
Currently I don't think that CPU is supported for mixed 8bit matrix multiplication (cc @TimDettmers) and using 8bit models on Hugging Face should be supported only when device_map=auto (In other words, you cannot provide a custom device_map as you showed it on the snippet). However, I think that this potential feature could be quite interesting and can be addressed as an improvement. In this case, modules that are set on cpu should stay native (i.e. in their original dtype) and only modules that are set on GPU should be quantized.
I added an issue on Hugging Face transformers and see what I can do!
https://github.com/huggingface/transformers/issues/19090
There is a PR (https://github.com/huggingface/transformers/pull/20281) that will add the support for a custom device_map when load_in_8bit=True. However, it was decided that it will not be merged until bitsandbytes supports weights offloading to CPU in 8-bit.
Is it technically possible to make bitsandbytes support 8bit on CPU? Because if it's not possible at all, then the transformers library may need to implement some way to allow offloading weights to CPU in their original dtype, while still converting GPU weights to 8bit (if the developers ever agree to that).
Does this also apply when the model is shared between GPU and disk? My device map looks like
{'shared': 0,
'decoder.embed_tokens': 0,
'encoder.embed_tokens': 0,
'encoder.block.0': 0,
'encoder.block.1': 0,
'encoder.block.2': 0,
'encoder.block.3': 0,
'encoder.block.4': 0,
'encoder.block.5': 0,
'encoder.block.6': 0,
'encoder.block.7': 0,
'encoder.block.8': 0,
'encoder.block.10': 'disk',
'encoder.block.11': 'disk',
'encoder.block.12': 'disk',
'encoder.block.13': 'disk',
'encoder.block.14': 'disk',
'encoder.block.15': 'disk',
'encoder.block.16': 'disk',
'encoder.block.17': 'disk',
'encoder.block.18': 'disk',
'encoder.block.19': 'disk',
'encoder.block.20': 'disk',
'encoder.block.21': 'disk',
'encoder.block.22': 'disk',
'encoder.block.23': 'disk',
'encoder.final_layer_norm': 'disk',
'encoder.dropout': 'disk',
'decoder.block': 'disk',
'decoder.final_layer_norm': 'disk',
'decoder.dropout': 'disk',
'lm_head': 'disk',
'encoder.block.9': 'disk'}
Running the last line of the following code
input_text = "translate English to German: How old are you?"
input_ids = flan_t5_tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = flan_t5.generate(input_ids)
raises this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
[<ipython-input-9-de7998043414>](https://localhost:8080/#) in <module>
2 input_ids = flan_t5_tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
3
----> 4 outputs = flan_t5.generate(input_ids)
5 print(tokenizer.decode(outputs[0]))
19 frames
[/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py](https://localhost:8080/#) in decorate_context(*args, **kwargs)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)
28 return cast(F, decorate_context)
29
[/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py](https://localhost:8080/#) in generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, **kwargs)
1253 # if model is encoder decoder encoder_outputs are created
1254 # and added to `model_kwargs`
-> 1255 model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(
1256 inputs_tensor, model_kwargs, model_input_name
1257 )
[/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py](https://localhost:8080/#) in _prepare_encoder_decoder_kwargs_for_generation(self, inputs_tensor, model_kwargs, model_input_name)
615 encoder_kwargs["return_dict"] = True
616 encoder_kwargs[model_input_name] = inputs_tensor
--> 617 model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)
618
619 return model_kwargs
[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
[/usr/local/lib/python3.8/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask, inputs_embeds, head_mask, cross_attn_head_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
1053 )
1054 else:
-> 1055 layer_outputs = layer_module(
1056 hidden_states,
1057 attention_mask=extended_attention_mask,
[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
[/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py](https://localhost:8080/#) in new_forward(*args, **kwargs)
154 output = old_forward(*args, **kwargs)
155 else:
--> 156 output = old_forward(*args, **kwargs)
157 return module._hf_hook.post_forward(module, output)
158
[/usr/local/lib/python3.8/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, hidden_states, attention_mask, position_bias, encoder_hidden_states, encoder_attention_mask, encoder_decoder_position_bias, layer_head_mask, cross_attn_layer_head_mask, past_key_value, use_cache, output_attentions, return_dict)
685 self_attn_past_key_value, cross_attn_past_key_value = None, None
686
--> 687 self_attention_outputs = self.layer[0](
688 hidden_states,
689 attention_mask=attention_mask,
[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
[/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py](https://localhost:8080/#) in new_forward(*args, **kwargs)
154 output = old_forward(*args, **kwargs)
155 else:
--> 156 output = old_forward(*args, **kwargs)
157 return module._hf_hook.post_forward(module, output)
158
[/usr/local/lib/python3.8/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, hidden_states, attention_mask, position_bias, layer_head_mask, past_key_value, use_cache, output_attentions)
591 ):
592 normed_hidden_states = self.layer_norm(hidden_states)
--> 593 attention_output = self.SelfAttention(
594 normed_hidden_states,
595 mask=attention_mask,
[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
[/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py](https://localhost:8080/#) in new_forward(*args, **kwargs)
154 output = old_forward(*args, **kwargs)
155 else:
--> 156 output = old_forward(*args, **kwargs)
157 return module._hf_hook.post_forward(module, output)
158
[/usr/local/lib/python3.8/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, hidden_states, mask, key_value_states, position_bias, past_key_value, layer_head_mask, query_length, use_cache, output_attentions)
510
511 # get query states
--> 512 query_states = shape(self.q(hidden_states)) # (batch_size, n_heads, seq_length, dim_per_head)
513
514 # get key/value states
[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
[/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py](https://localhost:8080/#) in new_forward(*args, **kwargs)
154 output = old_forward(*args, **kwargs)
155 else:
--> 156 output = old_forward(*args, **kwargs)
157 return module._hf_hook.post_forward(module, output)
158
[/usr/local/lib/python3.8/dist-packages/bitsandbytes/nn/modules.py](https://localhost:8080/#) in forward(self, x)
252 self.bias.data = self.bias.data.half()
253
--> 254 out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
255
256 if not self.state.has_fp16_weights:
[/usr/local/lib/python3.8/dist-packages/bitsandbytes/autograd/_functions.py](https://localhost:8080/#) in matmul(A, B, out, state, threshold, bias)
403 if threshold > 0.0:
404 state.threshold = threshold
--> 405 return MatMul8bitLt.apply(A, B, out, bias, state)
[/usr/local/lib/python3.8/dist-packages/bitsandbytes/autograd/_functions.py](https://localhost:8080/#) in forward(ctx, A, B, out, bias, state)
255 else:
256 if not state.has_fp16_weights and state.CxB is None:
--> 257 state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
258 subA = None
259
[/usr/local/lib/python3.8/dist-packages/bitsandbytes/functional.py](https://localhost:8080/#) in transform(A, to_order, from_order, out, transpose, state, ld)
1696
1697 def transform(A, to_order, from_order='row', out=None, transpose=False, state=None, ld=None):
-> 1698 prev_device = pre_call(A.device)
1699 if state is None: state = (A.shape, from_order)
1700 else: from_order = state[1]
AttributeError: 'NoneType' object has no attribute 'device'
This functionality can be achieved in transformers using some hacky workaround:
https://github.com/huggingface/transformers/pull/20281#issuecomment-1345605654
In the end it was implemented in the transformers, in https://github.com/huggingface/transformers/pull/21579.
Here's the working version:
import torch
from transformers import BitsAndBytesConfig, pipeline
device_map = {
"transformer.wte": 0,
"transformer.wpe": 0,
"transformer.ln_f": "cpu",
"lm_head": 0,
"transformer.h.0": 0,
"transformer.h.1": "cpu",
"transformer.h.2": "cpu",
"transformer.h.3": "cpu",
"transformer.h.4": "cpu",
"transformer.h.5": "cpu",
"transformer.h.6": "cpu",
"transformer.h.7": "cpu",
"transformer.h.8": "cpu",
"transformer.h.9": "cpu",
"transformer.h.10": "cpu",
"transformer.h.11": "cpu"
}
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_enable_fp32_cpu_offload=True,
llm_int8_skip_modules=["lm_head"]
)
pipe = pipeline(
model="EleutherAI/gpt-neo-125M",
max_length=32,
torch_dtype=torch.float16,
model_kwargs={
"device_map": device_map,
"quantization_config": quantization_config
}
)
print("\n", pipe("It was")[0]["generated_text"])
The key here is llm_int8_enable_fp32_cpu_offload=True which allows CPU offloading (not sure about disk offloading).
Is it technically possible to make
bitsandbytessupport 8bit on CPU?
There was no answer to this question, but if it's impossible or not feasible then I guess this issue may be closed.
Getting this error with device_map set to auto and load_in_8bit set to true. Any possible cause?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.