galai RuntimeError: CUDA error: invalid device ordinal

When I load model I have this error.

Traceback (most recent call last): File "", line 1, in File "test/env/lib/python3.9/site-packages/galai/init.py", line 39, in load_model model._load_checkpoint(checkpoint_path=get_checkpoint_path(name)) File "test/env/lib/python3.9/site-packages/galai/model.py", line 63, in _load_checkpoint load_checkpoint_and_dispatch( File "test/env/lib/python3.9/site-packages/accelerate/big_modeling.py", line 366, in load_checkpoint_and_dispatch load_checkpoint_in_model( File "test/env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 701, in load_checkpoint_in_model set_module_tensor_to_device(model, param_name, param_device, value=param) File "test/env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 124, in set_module_tensor_to_device new_value = value.to(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Nov 15 '22 21:11 elter-tef

Trying this with

model = galai.load_model("base")

it looks like there is a device map that expects 8 GPUs, if I'm seeing this right:

{'decoder.embed_tokens': 0,
 'decoder.embed_positions': 0,
 'decoder.layer_norm': 0,
 'decoder.layers.0': 0,
 'decoder.layers.1': 0,
 'decoder.layers.2': 0,
 'decoder.layers.3': 1,
 'decoder.layers.4': 1,
 'decoder.layers.5': 1,
 'decoder.layers.6': 2,
 'decoder.layers.7': 2,
 'decoder.layers.8': 2,
 'decoder.layers.9': 3,
 'decoder.layers.10': 3,
 'decoder.layers.11': 3,
 'decoder.layers.12': 4,
 'decoder.layers.13': 4,
 'decoder.layers.14': 4,
 'decoder.layers.15': 5,
 'decoder.layers.16': 5,
 'decoder.layers.17': 5,
 'decoder.layers.18': 6,
 'decoder.layers.19': 6,
 'decoder.layers.20': 6,
 'decoder.layers.21': 7,
 'decoder.layers.22': 7,
 'decoder.layers.23': 7}

Nov 15 '22 22:11 dginev

If you have less than the default number of GPUs (8), you have to specify how many when you load the model. Try: model = gal.load_model(name = 'base', num_gpus = 1)

Nov 15 '22 22:11 ZQ-Dev8

Thanks @dcruiz01 that worked out like a charm. Unsure if it deserves a mention in the README, but much appreciated for letting us know! We can probably close this issue.

Nov 15 '22 22:11 dginev

Confirmed. Had same error and num_gpus = 1 resolved it.