galai icon indicating copy to clipboard operation
galai copied to clipboard

RuntimeError: CUDA error: invalid device ordinal

Open elter-tef opened this issue 3 years ago • 13 comments

When I load model I have this error.

Traceback (most recent call last): File "", line 1, in File "test/env/lib/python3.9/site-packages/galai/init.py", line 39, in load_model model._load_checkpoint(checkpoint_path=get_checkpoint_path(name)) File "test/env/lib/python3.9/site-packages/galai/model.py", line 63, in _load_checkpoint load_checkpoint_and_dispatch( File "test/env/lib/python3.9/site-packages/accelerate/big_modeling.py", line 366, in load_checkpoint_and_dispatch load_checkpoint_in_model( File "test/env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 701, in load_checkpoint_in_model set_module_tensor_to_device(model, param_name, param_device, value=param) File "test/env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 124, in set_module_tensor_to_device new_value = value.to(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

elter-tef avatar Nov 15 '22 21:11 elter-tef

Trying this with

model = galai.load_model("base")

it looks like there is a device map that expects 8 GPUs, if I'm seeing this right:

{'decoder.embed_tokens': 0,
 'decoder.embed_positions': 0,
 'decoder.layer_norm': 0,
 'decoder.layers.0': 0,
 'decoder.layers.1': 0,
 'decoder.layers.2': 0,
 'decoder.layers.3': 1,
 'decoder.layers.4': 1,
 'decoder.layers.5': 1,
 'decoder.layers.6': 2,
 'decoder.layers.7': 2,
 'decoder.layers.8': 2,
 'decoder.layers.9': 3,
 'decoder.layers.10': 3,
 'decoder.layers.11': 3,
 'decoder.layers.12': 4,
 'decoder.layers.13': 4,
 'decoder.layers.14': 4,
 'decoder.layers.15': 5,
 'decoder.layers.16': 5,
 'decoder.layers.17': 5,
 'decoder.layers.18': 6,
 'decoder.layers.19': 6,
 'decoder.layers.20': 6,
 'decoder.layers.21': 7,
 'decoder.layers.22': 7,
 'decoder.layers.23': 7}

dginev avatar Nov 15 '22 22:11 dginev

If you have less than the default number of GPUs (8), you have to specify how many when you load the model. Try: model = gal.load_model(name = 'base', num_gpus = 1)

ZQ-Dev8 avatar Nov 15 '22 22:11 ZQ-Dev8

Thanks @dcruiz01 that worked out like a charm. Unsure if it deserves a mention in the README, but much appreciated for letting us know! We can probably close this issue.

dginev avatar Nov 15 '22 22:11 dginev

Confirmed. Had same error and num_gpus = 1 resolved it.

metaphorz avatar Nov 16 '22 20:11 metaphorz

Please mention that in your documentation / readme.

KnutJaegersberg avatar Nov 17 '22 19:11 KnutJaegersberg

A model size between base and standard would be nice. I barely can't fit standard on my RTX 3090, I think.

KnutJaegersberg avatar Nov 17 '22 19:11 KnutJaegersberg

Do you offer 8 bit versions/compatibility, like BLOOM?

KnutJaegersberg avatar Nov 17 '22 19:11 KnutJaegersberg

I see, dtype='float16' does the job sorry. Please mention in readme. Many folks will want to try on a local gpu as well.

KnutJaegersberg avatar Nov 17 '22 19:11 KnutJaegersberg

Hmm.. 8 bit would still be handy to play with larger models. Is that possible?

KnutJaegersberg avatar Nov 17 '22 19:11 KnutJaegersberg

Num of GPUs defaults to None.

zzj0402 avatar Nov 18 '22 06:11 zzj0402

If you have less than the default number of GPUs (8)

Who has a default number of 8 GPUs?

Bachstelze avatar Nov 18 '22 16:11 Bachstelze

If you have less than the default number of GPUs (8)

Who has a default number of 8 GPUs?

people that work at Meta AI, probably XD

ZQ-Dev8 avatar Nov 18 '22 17:11 ZQ-Dev8

If you have less than the default number of GPUs (8), you have to specify how many when you load the model. Try: model = gal.load_model(name = 'base', num_gpus = 1)

why this isnt written on main page

FurkanGozukara avatar Nov 19 '22 13:11 FurkanGozukara

galai 1.1.0 uses all available GPUs by default which should fix the issue. One can still manually specify the number of GPUs using num_gpus parameter. Setting num_gpus=0 (or keeping the default None if no GPUs are available) will load the model to RAM. 8 bit inference is not supported yet. Please reopen if you still experience any issues.

mkardas avatar Dec 09 '22 10:12 mkardas