CUDA error: operation not permitted
The following error occasionally occurs when using CUDA for inference. GPU:NVIDIA L20
2025-12-17 02:44:08,499 - __main__ - ERROR - [worker] Error while handling request: CUDA error: operation not permitted
Search for `cudaErrorNotPermitted' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "/opt/voxcpm/main.py", line 148, in worker
for chunk in tts.generate_streaming(**kwargs):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/voxcpm/src/voxcpm/core.py", line 269, in _generate
for wav, _, _ in generate_result:
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 38, in generator_context
response = gen.send(None)
^^^^^^^^^^^^^^
File "/opt/voxcpm/src/voxcpm/model/voxcpm.py", line 670, in _generate_with_prompt_cache
for latent_pred, pred_audio_feat in inference_result:
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 38, in generator_context
response = gen.send(None)
^^^^^^^^^^^^^^
File "/opt/voxcpm/src/voxcpm/model/voxcpm.py", line 742, in _inference
feat_embed = self.feat_encoder(feat) # [b, t, h_feat]
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 414, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 832, in compile_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/voxcpm/src/voxcpm/modules/locenc/local_encoder.py", line 17, in forward
def forward(self, x):
File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1130, in forward
return compiled_fn(full_args)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 353, in runtime_wrapper
all_outs = call_func_at_runtime_with_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
out = normalize_as_list(f(args))
^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 724, in inner_fn
outs = compiled_fn(args)
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
return compiled_fn(runtime_args)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 613, in __call__
return self.current_callable(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/torchinductor_root/u7/cu7iuglfv4fiy7yukmpkycnjjtrlt6ttrgfzj3ssjb5o5l4av6w4.py", line 2190, in call
(buf182,) = self.partitions[0](partition0_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1772, in run
return compiled_fn(new_inputs) # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_inductor/cudagraph_trees.py", line 388, in deferred_cudagraphify
return fn(inputs)
^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3017, in run
out = model(new_inputs)
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_inductor/cudagraph_trees.py", line 2012, in run
out = self._run(new_inputs, function_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_inductor/cudagraph_trees.py", line 2182, in _run
return self.record_function(new_inputs, function_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_inductor/cudagraph_trees.py", line 2219, in record_function
node = CUDAGraphNode(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_inductor/cudagraph_trees.py", line 1037, in __init__
self.recording_outputs: Optional[OutputType] = self._record(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/_inductor/cudagraph_trees.py", line 1268, in _record
torch.cuda.graph(
File "/usr/local/lib/python3.12/site-packages/torch/cuda/graphs.py", line 265, in __exit__
self.cuda_graph.capture_end()
File "/usr/local/lib/python3.12/site-packages/torch/cuda/graphs.py", line 128, in capture_end
super().capture_end()
torch.AcceleratorError: CUDA error: operation not permitted
Search for `cudaErrorNotPermitted' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set optimize=False.
Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set
optimize=False.
No,I didn't use multithreading; I simply executed the inference task in a single thread through threading.Thread
Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set
optimize=False.No,I didn't use multithreading; I simply executed the inference task in a single thread through
threading.Thread
Sub-thread is also not supported. https://github.com/OpenBMB/VoxCPM/issues/107#issuecomment-3630057542
Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set
optimize=False.No,I didn't use multithreading; I simply executed the inference task in a single thread through
threading.ThreadSub-thread is also not supported. #107 (comment)
Why does using NanonovlM GPU memory consume significantly more memory than using Torch directly, by more than 10 gigabytes?
Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set
optimize=False.No,I didn't use multithreading; I simply executed the inference task in a single thread through
threading.ThreadSub-thread is also not supported. #107 (comment)
Why does using NanonovlM GPU memory consume significantly more memory than using Torch directly, by more than 10 gigabytes?
Since nanovllm pre-allocates a fixed amount of GPU memory, you can control it by adjusting the parameters.
server = VoxCPM.from_pretrained(
model="~/VoxCPM-0.5B",
max_num_batched_tokens=8192,
max_num_seqs=16,
max_model_len=4096,
gpu_memory_utilization=0.95, # reduce those parameters
enforce_eager=False,
devices=[0],
)
Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set
optimize=False.No,I didn't use multithreading; I simply executed the inference task in a single thread through
threading.ThreadSub-thread is also not supported. #107 (comment)
Why does using NanonovlM GPU memory consume significantly more memory than using Torch directly, by more than 10 gigabytes?
Since nanovllm pre-allocates a fixed amount of GPU memory, you can control it by adjusting the parameters.
server = VoxCPM.from_pretrained( model="~/VoxCPM-0.5B", max_num_batched_tokens=8192, max_num_seqs=16, max_model_len=4096, gpu_memory_utilization=0.95, # reduce those parameters enforce_eager=False, devices=[0], )
What impact will this parameter have on performance?
Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set
optimize=False.No,I didn't use multithreading; I simply executed the inference task in a single thread through
threading.ThreadSub-thread is also not supported. #107 (comment)
Why does using NanonovlM GPU memory consume significantly more memory than using Torch directly, by more than 10 gigabytes?
Since nanovllm pre-allocates a fixed amount of GPU memory, you can control it by adjusting the parameters.
server = VoxCPM.from_pretrained( model="~/VoxCPM-0.5B", max_num_batched_tokens=8192, max_num_seqs=16, max_model_len=4096, gpu_memory_utilization=0.95, # reduce those parameters enforce_eager=False, devices=[0], )What impact will this parameter have on performance?
It generally has no impact on individual requests; it only affects the maximum concurrency.
Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set
optimize=False.No,I didn't use multithreading; I simply executed the inference task in a single thread through
threading.ThreadSub-thread is also not supported. #107 (comment)
Why does using NanonovlM GPU memory consume significantly more memory than using Torch directly, by more than 10 gigabytes?
Since nanovllm pre-allocates a fixed amount of GPU memory, you can control it by adjusting the parameters.
server = VoxCPM.from_pretrained( model="~/VoxCPM-0.5B", max_num_batched_tokens=8192, max_num_seqs=16, max_model_len=4096, gpu_memory_utilization=0.95, # reduce those parameters enforce_eager=False, devices=[0], )What impact will this parameter have on performance?
It generally has no impact on individual requests; it only affects the maximum concurrency.
This value cannot be set too small, otherwise an assert error will be triggered.
@zengruizhao 你可以自己调一下合适的值,至少要保证有5G的显存分配
参数中有个temperature,这个参数似乎在torch版本中没见到,不知道是不是跟大模型的temperature参数一样,控制输出的随机性,建议配置为多少呢
@zengruizhao 你可以自己调一下合适的值,至少要保证有5G的显存分配
目前这个temperature 没有起作用,可以不用管;看你有多大的显存吧,上面这个配置在24G 4090上用22G左右,你可以先试试都设为四分之一能不能跑起来
目前这个temperature 没有起作用,可以不用管;看你有多大的显存吧,上面这个配置在24G 4090上用22G左右,你可以先试试都设为四分之一能不能跑起来
这个温度参数影响还是很大的,如果设置0.1,基本可以必现生成的音频后会跟大段的静音,直到达到max_generate_length
这个温度参数影响还是很大的,如果设置0.1,基本可以必现生成的音频后会跟大段的静音,直到达到max_generate_length
目前这个参数没有正向作用,设置成1即可,其他值会导致推理崩掉
这个温度参数影响还是很大的,如果设置0.1,基本可以必现生成的音频后会跟大段的静音,直到达到max_generate_length
目前这个参数没有正向作用,设置成1即可,其他值会导致推理崩掉
感谢,能否添加下您的联系方式,我们正在尝试部署Voxcpm进行推理,并且在部署过程中也解决了一些技术问题,以后有相关的问题可以直接跟您反馈。在您主页没找到您的邮箱,您可以联系我的邮箱哈[email protected]
这个温度参数影响还是很大的,如果设置0.1,基本可以必现生成的音频后会跟大段的静音,直到达到max_generate_length
目前这个参数没有正向作用,设置成1即可,其他值会导致推理崩掉
感谢,能否添加下您的联系方式,我们正在尝试部署Voxcpm进行推理,并且在部署过程中也解决了一些技术问题,以后有相关的问题可以直接跟您反馈。在您主页没找到您的邮箱,您可以联系我的邮箱哈[email protected]
您可以进一下官方微信群,可以在其中提建议~