VoxCPM CUDA error: operation not permitted

The following error occasionally occurs when using CUDA for inference. GPU：NVIDIA L20

           2025-12-17 02:44:08,499 - __main__ - ERROR - [worker] Error while handling request: CUDA error: operation not permitted
Search for `cudaErrorNotPermitted' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
  File "/opt/voxcpm/main.py", line 148, in worker
    for chunk in tts.generate_streaming(**kwargs):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/voxcpm/src/voxcpm/core.py", line 269, in _generate
    for wav, _, _ in generate_result:
                     ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 38, in generator_context
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "/opt/voxcpm/src/voxcpm/model/voxcpm.py", line 670, in _generate_with_prompt_cache
    for latent_pred, pred_audio_feat in inference_result:
                                        ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 38, in generator_context
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "/opt/voxcpm/src/voxcpm/model/voxcpm.py", line 742, in _inference
    feat_embed = self.feat_encoder(feat)  # [b, t, h_feat]
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 414, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 832, in compile_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/voxcpm/src/voxcpm/modules/locenc/local_encoder.py", line 17, in forward
    def forward(self, x):
  File "/usr/local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1130, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 353, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 724, in inner_fn
    outs = compiled_fn(args)
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
    return compiled_fn(runtime_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 613, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_root/u7/cu7iuglfv4fiy7yukmpkycnjjtrlt6ttrgfzj3ssjb5o5l4av6w4.py", line 2190, in call
    (buf182,) = self.partitions[0](partition0_args)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1772, in run
    return compiled_fn(new_inputs)  # type: ignore[arg-type]
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_inductor/cudagraph_trees.py", line 388, in deferred_cudagraphify
    return fn(inputs)
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3017, in run
    out = model(new_inputs)
          ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_inductor/cudagraph_trees.py", line 2012, in run
    out = self._run(new_inputs, function_id)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_inductor/cudagraph_trees.py", line 2182, in _run
    return self.record_function(new_inputs, function_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_inductor/cudagraph_trees.py", line 2219, in record_function
    node = CUDAGraphNode(
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_inductor/cudagraph_trees.py", line 1037, in __init__
    self.recording_outputs: Optional[OutputType] = self._record(
                                                   ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/torch/_inductor/cudagraph_trees.py", line 1268, in _record
    torch.cuda.graph(
  File "/usr/local/lib/python3.12/site-packages/torch/cuda/graphs.py", line 265, in __exit__
    self.cuda_graph.capture_end()
  File "/usr/local/lib/python3.12/site-packages/torch/cuda/graphs.py", line 128, in capture_end
    super().capture_end()
torch.AcceleratorError: CUDA error: operation not permitted
Search for `cudaErrorNotPermitted' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Dec 17 '25 03:12 zengruizhao

Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set optimize=False.

Dec 17 '25 03:12 liuxin99

Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set optimize=False.

No，I didn't use multithreading; I simply executed the inference task in a single thread through threading.Thread

Dec 17 '25 03:12 zengruizhao

Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set optimize=False.

No，I didn't use multithreading; I simply executed the inference task in a single thread through threading.Thread

Sub-thread is also not supported. https://github.com/OpenBMB/VoxCPM/issues/107#issuecomment-3630057542

Dec 17 '25 03:12 liuxin99

Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set optimize=False.

No，I didn't use multithreading; I simply executed the inference task in a single thread through threading.Thread

Sub-thread is also not supported. #107 (comment)

Why does using NanonovlM GPU memory consume significantly more memory than using Torch directly, by more than 10 gigabytes?

Dec 19 '25 08:12 zengruizhao

Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set optimize=False.

No，I didn't use multithreading; I simply executed the inference task in a single thread through threading.Thread

Sub-thread is also not supported. #107 (comment)

Why does using NanonovlM GPU memory consume significantly more memory than using Torch directly, by more than 10 gigabytes?

Since nanovllm pre-allocates a fixed amount of GPU memory, you can control it by adjusting the parameters.

server = VoxCPM.from_pretrained(
        model="~/VoxCPM-0.5B",
        max_num_batched_tokens=8192,
        max_num_seqs=16,
        max_model_len=4096,
        gpu_memory_utilization=0.95,  # reduce those parameters
        enforce_eager=False,
        devices=[0],
    )

Dec 19 '25 08:12 liuxin99

Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set optimize=False.

No，I didn't use multithreading; I simply executed the inference task in a single thread through threading.Thread

Sub-thread is also not supported. #107 (comment)

Why does using NanonovlM GPU memory consume significantly more memory than using Torch directly, by more than 10 gigabytes?

Since nanovllm pre-allocates a fixed amount of GPU memory, you can control it by adjusting the parameters.
server = VoxCPM.from_pretrained(
        model="~/VoxCPM-0.5B",
        max_num_batched_tokens=8192,
        max_num_seqs=16,
        max_model_len=4096,
        gpu_memory_utilization=0.95,  # reduce those parameters
        enforce_eager=False,
        devices=[0],
    )

What impact will this parameter have on performance?

Dec 19 '25 08:12 zengruizhao

Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set optimize=False.

No，I didn't use multithreading; I simply executed the inference task in a single thread through threading.Thread

Sub-thread is also not supported. #107 (comment)

Why does using NanonovlM GPU memory consume significantly more memory than using Torch directly, by more than 10 gigabytes?

Since nanovllm pre-allocates a fixed amount of GPU memory, you can control it by adjusting the parameters.
server = VoxCPM.from_pretrained(
        model="~/VoxCPM-0.5B",
        max_num_batched_tokens=8192,
        max_num_seqs=16,
        max_model_len=4096,
        gpu_memory_utilization=0.95,  # reduce those parameters
        enforce_eager=False,
        devices=[0],
    )
What impact will this parameter have on performance?

It generally has no impact on individual requests; it only affects the maximum concurrency.

Dec 19 '25 09:12 liuxin99

Are you using multi-threading? Currently, models compiled with torch.compile do not support multi-threaded or multi-process calls. You can set optimize=False.

No，I didn't use multithreading; I simply executed the inference task in a single thread through threading.Thread

Sub-thread is also not supported. #107 (comment)

Why does using NanonovlM GPU memory consume significantly more memory than using Torch directly, by more than 10 gigabytes?

Since nanovllm pre-allocates a fixed amount of GPU memory, you can control it by adjusting the parameters.
server = VoxCPM.from_pretrained(
        model="~/VoxCPM-0.5B",
        max_num_batched_tokens=8192,
        max_num_seqs=16,
        max_model_len=4096,
        gpu_memory_utilization=0.95,  # reduce those parameters
        enforce_eager=False,
        devices=[0],
    )
What impact will this parameter have on performance?
It generally has no impact on individual requests; it only affects the maximum concurrency.

This value cannot be set too small, otherwise an assert error will be triggered.

Dec 19 '25 09:12 zengruizhao

@zengruizhao 你可以自己调一下合适的值，至少要保证有5G的显存分配

Dec 19 '25 09:12 liuxin99

参数中有个temperature，这个参数似乎在torch版本中没见到，不知道是不是跟大模型的temperature参数一样，控制输出的随机性，建议配置为多少呢

@zengruizhao 你可以自己调一下合适的值，至少要保证有5G的显存分配

Dec 19 '25 09:12 zengruizhao

目前这个temperature 没有起作用，可以不用管；看你有多大的显存吧，上面这个配置在24G 4090上用22G左右，你可以先试试都设为四分之一能不能跑起来

Dec 19 '25 09:12 liuxin99

目前这个temperature 没有起作用，可以不用管；看你有多大的显存吧，上面这个配置在24G 4090上用22G左右，你可以先试试都设为四分之一能不能跑起来

这个温度参数影响还是很大的，如果设置0.1，基本可以必现生成的音频后会跟大段的静音，直到达到max_generate_length

Dec 25 '25 02:12 zengruizhao

这个温度参数影响还是很大的，如果设置0.1，基本可以必现生成的音频后会跟大段的静音，直到达到max_generate_length

目前这个参数没有正向作用，设置成1即可，其他值会导致推理崩掉

Dec 25 '25 02:12 liuxin99

这个温度参数影响还是很大的，如果设置0.1，基本可以必现生成的音频后会跟大段的静音，直到达到max_generate_length

目前这个参数没有正向作用，设置成1即可，其他值会导致推理崩掉

感谢，能否添加下您的联系方式，我们正在尝试部署Voxcpm进行推理，并且在部署过程中也解决了一些技术问题，以后有相关的问题可以直接跟您反馈。在您主页没找到您的邮箱，您可以联系我的邮箱哈[email protected]

Dec 25 '25 03:12 zengruizhao

这个温度参数影响还是很大的，如果设置0.1，基本可以必现生成的音频后会跟大段的静音，直到达到max_generate_length

目前这个参数没有正向作用，设置成1即可，其他值会导致推理崩掉

感谢，能否添加下您的联系方式，我们正在尝试部署Voxcpm进行推理，并且在部署过程中也解决了一些技术问题，以后有相关的问题可以直接跟您反馈。在您主页没找到您的邮箱，您可以联系我的邮箱哈[email protected]

您可以进一下官方微信群，可以在其中提建议~

Dec 30 '25 06:12 liuxin99