bitsandbytes ValueError("8-bit operations on `bitsandbytes` are not supported under CPU!")

Hi Tim,

Thanks for your awesome work!

I'm using your method to load the largest BLOOM model (the BLOOM model with 176b parameters) onto 1 node with 8 GPUs.

model = AutoModelForCausalLM.from_pretrained(
                "bloom", 
                device_map="auto", 
                load_in_8bit=True,
            )

This line works for all the other smaller bloom models, eg. bloom-7b1. However when loading bloom (176b) I got error "8-bit operations on bitsandbytes are not supported under CPU!".

File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 463, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2182, in from_pretrained
    raise ValueError("8-bit operations on `bitsandbytes` are not supported under CPU!")
ValueError: 8-bit operations on `bitsandbytes` are not supported under CPU!

In my understanding, this is because some modules of the model are automatically loaded onto CPU, which didn't happen to the smaller models. Is there a way to force the model to be loaded to GPU only? or do you have any advice on how to bypass this error? Thanks!!

Tianwei

Aug 15 '22 21:08 Tianwei-She

From my testing it seems the following happens when not enough memory is available on GPU: hf accelerate automatic device selection sees device_map = auto, and puts some layers on CPU this device map with cpu layers is passed onward the bnb code in hf transformers sees the cpu layers and raises this confusing error message. My guess is that you lack enough GPU memory for bloom.

Aug 16 '22 19:08 aninrusimha

Hi @aninrusimha @Tianwei-She I second what @aninrusimha said, this error is thrown when you don't have enough GPU RAM to fit your quantized model before trying to assign it on the correct GPU device.
Could you also tell us what type of GPU are you using?

Aug 16 '22 22:08 younesbelkada

thanks for the reply! I'm using an AWS g5.48xlarge instance which has 192GiB GPU memory

Aug 16 '22 23:08 Tianwei-She

Actually I am a bit surprised it didn't fit your GPUs. Since I don't have access these machines, Could you please try to install transformers on dev mode - aka:

git clone https://github.com/huggingface/transformers
cd transformers
pip install -e ".[dev]"

And then add

print(device_map)

Just before this line: https://github.com/huggingface/transformers/blob/6d175c1129538b27230be170fc1184e8490e95ef/src/transformers/modeling_utils.py#L2181

Also could you point me to the exact commands (better send me maybe the full script) you are using? Thanks

Aug 16 '22 23:08 younesbelkada

I believe the main issue here is that you need to use the max_memory dictionary as an argument. By default, it can be that the dictionary allocates too much memory for the model, such that the mini-batch no longer fits onto the GPU. This then causes a CPU error.

Either decrease mini-batch size and sequence length until it fits, or use a max_memory dictionary which leaves a couple of GB of memory free on each GPU. So if you have 24 GB of memory per GPU, you want to use 22-23 GB only. However, BLOOM-176B might not fit with 22GB, and you need slightly more, something like 22.5GB, but I am not sure if floating point values are supported for the max_memory dictionary. @younesbelkada do you know more?

Aug 17 '22 03:08 TimDettmers

Thanks for replying!

@younesbelkada I printed out the device_map, there are indeed some modules not on GPU - 'transformer.h.69': 'disk', 'transformer.ln_f': 'disk'

{'transformer.word_embeddings': 0, 'lm_head': 0, 'transformer.word_embeddings_layernorm': 0, 'transformer.h.0': 0, 'transformer.h.1': 0, 'transformer.h.2': 0, 'transformer.h.3': 0, 'transformer.h.4': 0, 'transformer.h.5': 0, 'transformer.h.6': 1, 'transformer.h.7': 1, 'transformer.h.8': 1, 'transformer.h.9': 1, 'transformer.h.10': 1, 'transformer.h.11': 1, 'transformer.h.12': 1, 'transformer.h.13': 1, 'transformer.h.14': 1, 'transformer.h.15': 2, 'transformer.h.16': 2, 'transformer.h.17': 2, 'transformer.h.18': 2, 'transformer.h.19': 2, 'transformer.h.20': 2, 'transformer.h.21': 2, 'transformer.h.22': 2, 'transformer.h.23': 2, 'transformer.h.24': 3, 'transformer.h.25': 3, 'transformer.h.26': 3, 'transformer.h.27': 3, 'transformer.h.28': 3, 'transformer.h.29': 3, 'transformer.h.30': 3, 'transformer.h.31': 3, 'transformer.h.32': 3, 'transformer.h.33': 4, 'transformer.h.34': 4, 'transformer.h.35': 4, 'transformer.h.36': 4, 'transformer.h.37': 4, 'transformer.h.38': 4, 'transformer.h.39': 4, 'transformer.h.40': 4, 'transformer.h.41': 4, 'transformer.h.42': 5, 'transformer.h.43': 5, 'transformer.h.44': 5, 'transformer.h.45': 5, 'transformer.h.46': 5, 'transformer.h.47': 5, 'transformer.h.48': 5, 'transformer.h.49': 5, 'transformer.h.50': 5, 'transformer.h.51': 6, 'transformer.h.52': 6, 'transformer.h.53': 6, 'transformer.h.54': 6, 'transformer.h.55': 6, 'transformer.h.56': 6, 'transformer.h.57': 6, 'transformer.h.58': 6, 'transformer.h.59': 6, 'transformer.h.60': 7, 'transformer.h.61': 7, 'transformer.h.62': 7, 'transformer.h.63': 7, 'transformer.h.64': 7, 'transformer.h.65': 7, 'transformer.h.66': 7, 'transformer.h.67': 7, 'transformer.h.68': 7, 'transformer.h.69': 'disk', 'transformer.ln_f': 'disk'}

@TimDettmers I've added max_memory as an argument, even with 23GB max_memory I'm still getting the error. The code I ran is

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

free_in_GB = int(torch.cuda.mem_get_info()[0]/1024**3)
# max_memory = f'{free_in_GB-2}GB'
# max_memory = f'{free_in_GB}GB'
max_memory = f'23GB'
n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
print(max_memory)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", device_map="auto", load_in_8bit=True, max_memory=max_memory)

torch.cuda.mem_get_info()[0]/1024**3 is 21.5, nvidia-smi shows each GPU has 23028MiB memory.

I understand this is most likely caused by insufficient GPU memory, however I'm wondering how BLOOM model was able to be run on 8x RTX 3090 with 24GB memory as shown in the paper

Aug 23 '22 06:08 Tianwei-She

@TimDettmers btw I also tried tuning the parameter int8_threshold, with int8_threshold = 0, the memory usage is the same as the default int8_threshold = 0.6. Just wanted to confirm, is this expected? Thanks again for your help!

Aug 23 '22 06:08 Tianwei-She

it is as expected that threshold 0 and 6 use the close to same memory with the current implementation. The difference should be in the order of a couple of megabytes.

If you are still receiving an error, you can try to tweak the exact amounts of memory reserved for the model and the activations. You might want to use between max_memory=22016MB (21.5 GB) and max_memory=22784MB (22.25 GB) which leaves the rest of the memory for the activations.

What is also important in this case is this max memory used for activations during inference. If your sequence dimension during inference is high, you might run out of memory at some point because the margins are so small.

In that case, you need to retweak the max_memory parameters. It could also help to remove the caching from the model, but I am not sure how to do that.

Sep 05 '22 22:09 TimDettmers

I am closing this as this issue is related to a part of the model being on the CPU which is currently managed by the accelerate library. If this is still relevant, please open an issue there.

Regarding the BLOOM model, I will try to debug the situation and post examples to run BLOOM in a setup similar to yours.

Oct 27 '22 14:10 TimDettmers