bitsandbytes Unable to use load_in_8bit when the model is shared between GPU and CPU

It seems like bitsandbytes can't be used if the model is shared between GPU and CPU. I could not find any info saying that the entire model must be loaded in GPU in order to use bitsandbytes, so I'm not sure if this is a bug or the expected behavior.

The environment setup:

pip install --extra-index-url https://download.pytorch.org/whl/cu116 torch==1.12.1+cu116
pip install transformers==4.22.1
pip install accelerate==0.12.0
pip install bitsandbytes==0.33.1

The main.py script:

from transformers import pipeline

auto_map = False
load_in_8bit = True

if auto_map:
    device_map = "auto"
else:
    device_map = {
        "transformer.wte": 0,
        "transformer.wpe": 0,
        "transformer.ln_f": "cpu",
        "lm_head": 0,
        "transformer.h.0": 0,
        "transformer.h.1": "cpu",
        "transformer.h.2": "cpu",
        "transformer.h.3": "cpu",
        "transformer.h.4": "cpu",
        "transformer.h.5": "cpu",
        "transformer.h.6": "cpu",
        "transformer.h.7": "cpu",
        "transformer.h.8": "cpu",
        "transformer.h.9": "cpu",
        "transformer.h.10": "cpu",
        "transformer.h.11": "cpu"
    }

pipe = pipeline(
    model="EleutherAI/gpt-neo-125M",
    max_length=32,
    model_kwargs={
        "device_map": device_map,
        "load_in_8bit": load_in_8bit
    }
)

print("\n", pipe("It was")[0]["generated_text"])

The auto_map and load_in_8bit control the script settings.

When you run the script with auto_map = False and load_in_8bit = True then it crashes with this error:

❯ python main.py
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1.44k/1.44k [00:00<00:00, 634kB/s]

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:20: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/home/user/.gtkrc'), PosixPath('/etc/gtk/gtkrc')}

[... lots of similar warnings about non-existent paths ...]

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Traceback (most recent call last):
  File "/home/user/test/bnb-test/main.py", line 37, in <module>
    print("\n", pipe("It was")[0]["generated_text"])
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 176, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1074, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1081, in run_single
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 990, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 218, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/generation_utils.py", line 1319, in generate
    return self.greedy_search(
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/generation_utils.py", line 1713, in greedy_search
    outputs = self(
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 744, in forward
    transformer_outputs = self.transformer(
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 623, in forward
    outputs = block(
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 328, in forward
    attn_outputs = self.attn(
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 280, in forward
    return self.attention(
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 224, in forward
    query = self.q_proj(hidden_states)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 256, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 391, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 254, in forward
    state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1604, in transform
    prev_device = pre_call(A.device)
AttributeError: 'NoneType' object has no attribute 'device'

All other combinations of auto_map and load_in_8bit produce no error and give the generated_text.

Sep 17 '22 17:09 z80maniac

Hey ! Thanks for your message, Currently I don't think that CPU is supported for mixed 8bit matrix multiplication (cc @TimDettmers) and using 8bit models on Hugging Face should be supported only when device_map=auto (In other words, you cannot provide a custom device_map as you showed it on the snippet). However, I think that this potential feature could be quite interesting and can be addressed as an improvement. In this case, modules that are set on cpu should stay native (i.e. in their original dtype) and only modules that are set on GPU should be quantized. I added an issue on Hugging Face transformers and see what I can do! https://github.com/huggingface/transformers/issues/19090

Sep 17 '22 20:09 younesbelkada

There is a PR (https://github.com/huggingface/transformers/pull/20281) that will add the support for a custom device_map when load_in_8bit=True. However, it was decided that it will not be merged until bitsandbytes supports weights offloading to CPU in 8-bit.

Is it technically possible to make bitsandbytes support 8bit on CPU? Because if it's not possible at all, then the transformers library may need to implement some way to allow offloading weights to CPU in their original dtype, while still converting GPU weights to 8bit (if the developers ever agree to that).

Nov 18 '22 17:11 z80maniac

Does this also apply when the model is shared between GPU and disk? My device map looks like

{'shared': 0,
 'decoder.embed_tokens': 0,
 'encoder.embed_tokens': 0,
 'encoder.block.0': 0,
 'encoder.block.1': 0,
 'encoder.block.2': 0,
 'encoder.block.3': 0,
 'encoder.block.4': 0,
 'encoder.block.5': 0,
 'encoder.block.6': 0,
 'encoder.block.7': 0,
 'encoder.block.8': 0,
 'encoder.block.10': 'disk',
 'encoder.block.11': 'disk',
 'encoder.block.12': 'disk',
 'encoder.block.13': 'disk',
 'encoder.block.14': 'disk',
 'encoder.block.15': 'disk',
 'encoder.block.16': 'disk',
 'encoder.block.17': 'disk',
 'encoder.block.18': 'disk',
 'encoder.block.19': 'disk',
 'encoder.block.20': 'disk',
 'encoder.block.21': 'disk',
 'encoder.block.22': 'disk',
 'encoder.block.23': 'disk',
 'encoder.final_layer_norm': 'disk',
 'encoder.dropout': 'disk',
 'decoder.block': 'disk',
 'decoder.final_layer_norm': 'disk',
 'decoder.dropout': 'disk',
 'lm_head': 'disk',
 'encoder.block.9': 'disk'}

Running the last line of the following code

input_text = "translate English to German: How old are you?"
input_ids = flan_t5_tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = flan_t5.generate(input_ids)

raises this error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-9-de7998043414>](https://localhost:8080/#) in <module>
      2 input_ids = flan_t5_tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
      3 
----> 4 outputs = flan_t5.generate(input_ids)
      5 print(tokenizer.decode(outputs[0]))

19 frames
[/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py](https://localhost:8080/#) in decorate_context(*args, **kwargs)
     25         def decorate_context(*args, **kwargs):
     26             with self.clone():
---> 27                 return func(*args, **kwargs)
     28         return cast(F, decorate_context)
     29 

[/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py](https://localhost:8080/#) in generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, **kwargs)
   1253             # if model is encoder decoder encoder_outputs are created
   1254             # and added to `model_kwargs`
-> 1255             model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(
   1256                 inputs_tensor, model_kwargs, model_input_name
   1257             )

[/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py](https://localhost:8080/#) in _prepare_encoder_decoder_kwargs_for_generation(self, inputs_tensor, model_kwargs, model_input_name)
    615         encoder_kwargs["return_dict"] = True
    616         encoder_kwargs[model_input_name] = inputs_tensor
--> 617         model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)
    618 
    619         return model_kwargs

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.8/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask, inputs_embeds, head_mask, cross_attn_head_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1053                 )
   1054             else:
-> 1055                 layer_outputs = layer_module(
   1056                     hidden_states,
   1057                     attention_mask=extended_attention_mask,

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py](https://localhost:8080/#) in new_forward(*args, **kwargs)
    154                 output = old_forward(*args, **kwargs)
    155         else:
--> 156             output = old_forward(*args, **kwargs)
    157         return module._hf_hook.post_forward(module, output)
    158 

[/usr/local/lib/python3.8/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, hidden_states, attention_mask, position_bias, encoder_hidden_states, encoder_attention_mask, encoder_decoder_position_bias, layer_head_mask, cross_attn_layer_head_mask, past_key_value, use_cache, output_attentions, return_dict)
    685             self_attn_past_key_value, cross_attn_past_key_value = None, None
    686 
--> 687         self_attention_outputs = self.layer[0](
    688             hidden_states,
    689             attention_mask=attention_mask,

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py](https://localhost:8080/#) in new_forward(*args, **kwargs)
    154                 output = old_forward(*args, **kwargs)
    155         else:
--> 156             output = old_forward(*args, **kwargs)
    157         return module._hf_hook.post_forward(module, output)
    158 

[/usr/local/lib/python3.8/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, hidden_states, attention_mask, position_bias, layer_head_mask, past_key_value, use_cache, output_attentions)
    591     ):
    592         normed_hidden_states = self.layer_norm(hidden_states)
--> 593         attention_output = self.SelfAttention(
    594             normed_hidden_states,
    595             mask=attention_mask,

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py](https://localhost:8080/#) in new_forward(*args, **kwargs)
    154                 output = old_forward(*args, **kwargs)
    155         else:
--> 156             output = old_forward(*args, **kwargs)
    157         return module._hf_hook.post_forward(module, output)
    158 

[/usr/local/lib/python3.8/dist-packages/transformers/models/t5/modeling_t5.py](https://localhost:8080/#) in forward(self, hidden_states, mask, key_value_states, position_bias, past_key_value, layer_head_mask, query_length, use_cache, output_attentions)
    510 
    511         # get query states
--> 512         query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)
    513 
    514         # get key/value states

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1192         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194             return forward_call(*input, **kwargs)
   1195         # Do not call functions when jit is used
   1196         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py](https://localhost:8080/#) in new_forward(*args, **kwargs)
    154                 output = old_forward(*args, **kwargs)
    155         else:
--> 156             output = old_forward(*args, **kwargs)
    157         return module._hf_hook.post_forward(module, output)
    158 

[/usr/local/lib/python3.8/dist-packages/bitsandbytes/nn/modules.py](https://localhost:8080/#) in forward(self, x)
    252             self.bias.data = self.bias.data.half()
    253 
--> 254         out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
    255 
    256         if not self.state.has_fp16_weights:

[/usr/local/lib/python3.8/dist-packages/bitsandbytes/autograd/_functions.py](https://localhost:8080/#) in matmul(A, B, out, state, threshold, bias)
    403     if threshold > 0.0:
    404         state.threshold = threshold
--> 405     return MatMul8bitLt.apply(A, B, out, bias, state)

[/usr/local/lib/python3.8/dist-packages/bitsandbytes/autograd/_functions.py](https://localhost:8080/#) in forward(ctx, A, B, out, bias, state)
    255         else:
    256             if not state.has_fp16_weights and state.CxB is None:
--> 257                 state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
    258             subA = None
    259 

[/usr/local/lib/python3.8/dist-packages/bitsandbytes/functional.py](https://localhost:8080/#) in transform(A, to_order, from_order, out, transpose, state, ld)
   1696 
   1697 def transform(A, to_order, from_order='row', out=None, transpose=False, state=None, ld=None):
-> 1698     prev_device = pre_call(A.device)
   1699     if state is None: state = (A.shape, from_order)
   1700     else: from_order = state[1]

AttributeError: 'NoneType' object has no attribute 'device'

Jan 30 '23 09:01 ryan-caesar-ramos

This functionality can be achieved in transformers using some hacky workaround: https://github.com/huggingface/transformers/pull/20281#issuecomment-1345605654

Feb 02 '23 19:02 z80maniac

In the end it was implemented in the transformers, in https://github.com/huggingface/transformers/pull/21579.

Here's the working version:

import torch
from transformers import BitsAndBytesConfig, pipeline

device_map = {
    "transformer.wte": 0,
    "transformer.wpe": 0,
    "transformer.ln_f": "cpu",
    "lm_head": 0,
    "transformer.h.0": 0,
    "transformer.h.1": "cpu",
    "transformer.h.2": "cpu",
    "transformer.h.3": "cpu",
    "transformer.h.4": "cpu",
    "transformer.h.5": "cpu",
    "transformer.h.6": "cpu",
    "transformer.h.7": "cpu",
    "transformer.h.8": "cpu",
    "transformer.h.9": "cpu",
    "transformer.h.10": "cpu",
    "transformer.h.11": "cpu"
}

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_enable_fp32_cpu_offload=True,
    llm_int8_skip_modules=["lm_head"]
)

pipe = pipeline(
    model="EleutherAI/gpt-neo-125M",
    max_length=32,
    torch_dtype=torch.float16,
    model_kwargs={
        "device_map": device_map,
        "quantization_config": quantization_config
    }
)

print("\n", pipe("It was")[0]["generated_text"])

The key here is llm_int8_enable_fp32_cpu_offload=True which allows CPU offloading (not sure about disk offloading).

Is it technically possible to make bitsandbytes support 8bit on CPU?

There was no answer to this question, but if it's impossible or not feasible then I guess this issue may be closed.

Feb 28 '23 18:02 z80maniac

Getting this error with device_map set to auto and load_in_8bit set to true. Any possible cause?

Mar 16 '23 02:03 0xbitches

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Dec 20 '23 16:12 github-actions[bot]