[bug]: RuntimeError: CUDA error: device-side assert triggered
Is there an existing issue for this?
- [X] I have searched the existing issues
OS
Windows
GPU
cuda
VRAM
4GB
What happened?
Making a new iteration on a custom model, using the k_euler_a and k_dpmpp_2_a samplers, length of prompt ~476 characters.
The error states the prompt is too big, but i have used this prompt before without problems.
I have updated to the last InvokeAI version 2.2.4, i did this using the manual git pull method, and then running the reconfigure script.
Startup command: python scripts/invoke.py --web --no-nsfw_checker --model swpunk
>> Setting Sampler to k_euler_a
>> Prompt is 6 token(s) too long and has been truncated
>> Prompt is 2 token(s) too long and has been truncated
Generating: 0%| | 0/1 [00:00<?, ?it/s]>> Ksampler using model noise schedule (steps >= 30)
>> Sampling with k_euler_ancestral starting at step 0 of 32 (32 new sampling steps)
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [6,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [6,0,0], thread: [2,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [6,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [6,0,0], thread: [4,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [6,0,0], thread: [5,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [8,0,0], thread: [32,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [8,0,0], thread: [33,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [8,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [8,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [8,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [8,0,0], thread: [66,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [8,0,0], thread: [67,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [8,0,0], thread: [68,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [13,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [13,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [13,0,0], thread: [66,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [18,0,0], thread: [96,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [18,0,0], thread: [97,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [18,0,0], thread: [98,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [10,0,0], thread: [96,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [10,0,0], thread: [97,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [13,0,0], thread: [96,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [13,0,0], thread: [97,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [13,0,0], thread: [98,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [13,0,0], thread: [99,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [13,0,0], thread: [100,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
These errors go on for 1308 lines in total. And then this following exception is thrown:
Traceback (most recent call last):
File "d:\ai\invokeai\ldm\generate.py", line 492, in prompt2image
results = generator.generate(
File "d:\ai\invokeai\ldm\invoke\generator\base.py", line 98, in generate
image = make_image(x_T)
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "d:\ai\invokeai\ldm\invoke\generator\txt2img.py", line 42, in make_image
samples, _ = sampler.sample(
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "d:\ai\invokeai\ldm\models\diffusion\ksampler.py", line 226, in sample
K.sampling.__dict__[f'sample_{self.schedule}'](
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\k_diffusion\sampling.py", line 145, in sample_euler_ancestral
denoised = model(x, sigmas[i] * s_in, **extra_args)
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "d:\ai\invokeai\ldm\models\diffusion\ksampler.py", line 52, in forward
next_x = self.invokeai_diffuser.do_diffusion_step(x, sigma, uncond, cond, cond_scale)
File "d:\ai\invokeai\ldm\models\diffusion\shared_invokeai_diffusion.py", line 107, in do_diffusion_step
unconditioned_next_x, conditioned_next_x = self.apply_standard_conditioning(x, sigma, unconditioning, conditioning)
File "d:\ai\invokeai\ldm\models\diffusion\shared_invokeai_diffusion.py", line 123, in apply_standard_conditioning
unconditioned_next_x, conditioned_next_x = self.model_forward_callback(x_twice, sigma_twice,
File "d:\ai\invokeai\ldm\models\diffusion\ksampler.py", line 38, in <lambda>
model_forward_callback=lambda x, sigma, cond: self.inner_model(x, sigma, cond=cond))
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\k_diffusion\external.py", line 114, in forward
eps = self.get_eps(input * c_in, self.sigma_to_t(sigma), **kwargs)
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\k_diffusion\external.py", line 140, in get_eps
return self.inner_model.apply_model(*args, **kwargs)
File "d:\ai\invokeai\ldm\models\diffusion\ddpm.py", line 1441, in apply_model
x_recon = self.model(x_noisy, t, **cond)
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "d:\ai\invokeai\ldm\models\diffusion\ddpm.py", line 2167, in forward
out = self.diffusion_model(x, t, context=cc)
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "d:\ai\invokeai\ldm\modules\diffusionmodules\openaimodel.py", line 806, in forward
h = module(h, emb, context)
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "d:\ai\invokeai\ldm\modules\diffusionmodules\openaimodel.py", line 88, in forward
x = layer(x, context)
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "d:\ai\invokeai\ldm\modules\attention.py", line 271, in forward
x = block(x, context=context)
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "d:\ai\invokeai\ldm\modules\attention.py", line 221, in forward
return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
File "d:\ai\invokeai\ldm\modules\diffusionmodules\util.py", line 159, in checkpoint
return func(*inputs)
File "d:\ai\invokeai\ldm\modules\attention.py", line 226, in _forward
x += self.attn2(self.norm2(x.clone()), context=context)
File "C:\Users\username\anaconda3\envs\invokeai\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "d:\ai\invokeai\ldm\modules\attention.py", line 199, in forward
r = self.get_invokeai_attention_mem_efficient(q, k, v)
File "d:\ai\invokeai\ldm\models\diffusion\cross_attention_control.py", line 291, in get_invokeai_attention_mem_efficient
return self.einsum_op_cuda(q, k, v)
File "d:\ai\invokeai\ldm\models\diffusion\cross_attention_control.py", line 285, in einsum_op_cuda
return self.einsum_op_tensor_mem(q, k, v, mem_free_total / 3.3 / (1 << 20))
File "d:\ai\invokeai\ldm\models\diffusion\cross_attention_control.py", line 264, in einsum_op_tensor_mem
return self.einsum_lowest_level(q, k, v, None, None, None)
File "d:\ai\invokeai\ldm\models\diffusion\cross_attention_control.py", line 229, in einsum_lowest_level
self.attention_slice_calculated_callback(attention_slice, dim, offset, slice_size)
File "d:\ai\invokeai\ldm\models\diffusion\shared_invokeai_diffusion.py", line 69, in <lambda>
lambda slice, dim, offset, slice_size, key=key: callback(slice, dim, offset, slice_size, key))
File "d:\ai\invokeai\ldm\models\diffusion\shared_invokeai_diffusion.py", line 61, in callback
saver.add_attention_maps(slice, key)
File "d:\ai\invokeai\ldm\models\diffusion\cross_attention_map_saving.py", line 39, in add_attention_maps
self.collated_maps[key_and_size] += maps.cpu()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
>> Could not generate image.
Screenshots
No response
Additional context
No response
Contact Details
No response
Been testing it on the command line to debug the tokens, it did go over (way over) 77 tokens, but apart from that i never seen such an error as reported above.

Then here it did go wrong
"(snthwve style)+ (nvinkpunk)- a drunk beautiful woman as delirium from sandman, (hallucinating colorful soap bubbles)+, by jeremy mann, by sandra chevrier, by dave mckean and richard avedon and maciej kuciara, punk rock, tank girl, high detailed, 8k, sharp focus, natural lighting, subsurface scattering, F2, 35mm" -s 32 -W 512 -H 512 -C 7 -A k_euler_a --log_tokenization
>> Parsed prompt to FlattenedPrompt:[Fragment:'snthwve style'@1.1, Fragment:'nvinkpunk'@0.9, Fragment:'a drunk beautiful woman as delirium from sandman,'@1.0, Fragment:'hallucinating colorful soap bubbles'@1.1, Fragment:', by jeremy mann, by sandra chevrier, by dave mckean and richard avedon and maciej kuciara, punk rock, tank girl, high detailed, 8k, sharp focus, natural lighting, subsurface scattering, F2, 35mm'@1.0]
>> Parsed negative prompt to FlattenedPrompt:[Fragment:''@1.0]
>> Prompt is 3 token(s) too long and has been truncated
>> Tokens (prompt) (77):
snthwve style nvinkpunk a drunk beautiful woman as delirium from sandman , hallucinating colorful soap bubbles , by jeremy mann , by sandra chevrier , by dave mckean and richard avedon and maciej kuciara , punk rock , tank girl , high detailed , 8 k , sharp focus , natural lighting , subsurface scattering , f 2 , 3 5
>> Tokens Discarded (1):
mm
>> Tokens (unconditioning) (0):
Generating: 0%| | 0/1 [00:00<?, ?it/s]>> Ksampler using model noise schedule (steps >= 30)
>> Sampling with k_euler_ancestral starting at step 0 of 32 (32 new sampling steps)
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\IndexKernel.cu:91: block: [30,0,0], thread: [32,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
...(same errors as shown before)...
...(stacktrace as shown before)...
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
So the prompt is too long, it will discard tokens but when you add multiple (...) functions it goes off the rails completely.
This goes for all samplers, so it resides in the tokenization process/parsing if i had to guess.
This is basically the same issue mentioned here #1908 there is a pull request in for a "fix", which is basically just to exit out before calling the function that is failing. I have instructions in this issue to basically do the same thing until the pull request is added to the main repo.
Fixed in #1999