ktransformers [Bug] 使用ascend 310p基于deepseek v3 q4km量化模型推理，报错call hccl api failed，Failed to allocate memory

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
[ ] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

执行python /home/ktransformers-main/ktransformers/server/main.py，使用deepseek-v3 q4km gguf模型，报错： 2025-10-29 03:01:27,790 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend found flashinfer /usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:298: ImportWarning:

The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now.. The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now.. The backend in torch.distributed.init_process_group set to hccl now.. The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now.. The device parameters have been replaced with npu in the function below: torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty

warnings.warn(msg, ImportWarning) /usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:255: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu. warnings.warn(msg, RuntimeWarning) /usr/local/lib64/python3.11/site-packages/ktransformers/server/api/ollama/completions.py:257: PydanticDeprecatedSince20: Support for class-based config is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/ class OllamaShowResponse(BaseModel): The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues. The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues. flash_attn not found, flashinfer unit test needed it. If you are using balance serve, ignore this. set start method Connected to server at tcp://localhost:41617 2025-10-29 03:01:36,526 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend found flashinfer /usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:298: ImportWarning:

The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now.. The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now.. The backend in torch.distributed.init_process_group set to hccl now.. The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now.. The device parameters have been replaced with npu in the function below: torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty

warnings.warn(msg, ImportWarning) /usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:255: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu. warnings.warn(msg, RuntimeWarning) /usr/local/lib64/python3.11/site-packages/ktransformers/server/api/ollama/completions.py:257: PydanticDeprecatedSince20: Support for class-based config is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/ class OllamaShowResponse(BaseModel): flash_attn not found, flashinfer unit test needed it. If you are using balance serve, ignore this. start method already set to spawn 2025-10-29 03:01:48,232 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend found flashinfer /usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:298: ImportWarning:

The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now.. The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now.. The backend in torch.distributed.init_process_group set to hccl now.. The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now.. The device parameters have been replaced with npu in the function below: torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty

warnings.warn(msg, ImportWarning) /usr/local/lib64/python3.11/site-packages/torch_npu/contrib/transfer_to_npu.py:255: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu. warnings.warn(msg, RuntimeWarning) /usr/local/lib64/python3.11/site-packages/ktransformers/server/api/ollama/completions.py:257: PydanticDeprecatedSince20: Support for class-based config is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/ class OllamaShowResponse(BaseModel): start to init process group ------rank is 0, world_size is 1 [W1029 03:01:51.812052080 socket.cpp:752] [c10d] The client socket cannot be initialized to connect to [localhost]:31777 (errno: 97 - Address family not supported by protocol). init process group success ------rank is 0, world_size is 1 Connected to server at tcp://localhost:41617 args.architectures: DeepSeek-Coder-V2-Instruct The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues. The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues. /usr/lib64/python3.11/tempfile.py:904: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpf2t8y3jt'> _warnings.warn(warn_message, ResourceWarning) sys:1: DeprecationWarning: builtin type swigvarlink has no module attribute Process SpawnProcess-1: Traceback (most recent call last): File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self.kwargs) File "/usr/local/lib64/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 403, in run_engine engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib64/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 255, in init torch.distributed.barrier(group=tp_group) File "/usr/local/lib64/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib64/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier work = group.barrier(opts=opts) ^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: create_config:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:148 HCCL function error: hcclCommInitRootInfoConfig(numRanks, &rootInfo, rank, config, &(comm->hcclComm)), error code is 2 [ERROR] 2025-10-29-03:02:35 (PID:501951, Device:0, RankID:0) ERR02200 DIST call hccl api failed. EL0004: [PID: 501951] 2025-10-29-03:02:35.297.548 Failed to allocate memory. Possible Cause: Available memory is insufficient. Solution: Close applications not in use. TraceBack (most recent call last): Failed to allocate resource[DeviceMemory] with info [size:32]. Reason: Memory resources are exhausted.

/usr/lib64/python3.11/multiprocessing/process.py:330: ResourceWarning: Unclosed socket <zmq.Socket(zmq.PUB) at 0xfffd27733460> traceback.print_exc() /usr/lib64/python3.11/multiprocessing/process.py:330: ResourceWarning: Unclosed context <zmq.Context() at 0xfffcea209c10> traceback.print_exc() sys:1: DeprecationWarning: builtin type swigvarlink has no module attribute

Reproduction

python /home/ktransformers-main/ktransformers/server/main.py --model_path /models/deepseek/deepseek-v3-config/ --gguf_path /models/deepseek/deepseek-v3-gguf/ --cpu_infer 120 --optimize_config_path /home/ktransformers-main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-npu.yaml --backend_type balance_serve --port 31444 --architectures KDeepseekV3ForCausalLM --max_new_tokens 128 --max_batch_size 4 --use_cuda_graph --tp 1

Environment

安装rpm： Ascend-cann-toolkit-8.2.RC1-linux.aarch64 Ascend-cann-nnal-8.2.RC1-linux.aarch64 Ascend-cann-kernels-310p-8.2.RC1-linux.aarch64

pip安装的关键包版本： ktransformers 0.3.2+npu2.5.1.post1torch25aarch64 torch 2.5.1 torch-npu 2.5.1.post1 torchaudio 2.5.1 torchvision 0.20.1 transformers 4.57.1

npu及驱动信息： npu-smi info +--------------------------------------------------------------------------------------------------------+ | npu-smi 25.2.2 Version: 25.2.2 | +-------------------------------+-----------------+------------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page) | | Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) | +===============================+=================+======================================================+ | 1 310P3 | OK | NA 47 0 / 0 | | 0 0 | 0000:01:00.0 | 0 1872 / 23047 | +===============================+=================+======================================================+ +-------------------------------+-----------------+------------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===============================+=================+======================================================+ | No running processes found in NPU 1 | +===============================+=================+======================================================+

Oct 29 '25 03:10 fanzetian

您好，readme中Atlas 300IA2推理卡使用的是Ascend910B4 32G的芯片，目前Ascend 310P硬件暂时不支持

Oct 30 '25 03:10 ShiyaNiu

您好，readme中Atlas 300IA2推理卡使用的是Ascend910B4 32G的芯片，目前Ascend 310P硬件暂时不支持

您好，感谢回复，如果我想自己适配的话，有哪些工作需要做呢

Oct 30 '25 03:10 fanzetian

您好，readme中Atlas 300IA2推理卡使用的是Ascend910B4 32G的芯片，目前Ascend 310P硬件暂时不支持

您好，感谢回复，如果我想自己适配的话，有哪些工作需要做呢

你好，请问最后解决了吗？

Nov 06 '25 08:11 xwqtju

您好，readme中Atlas 300IA2推理卡使用的是Ascend910B4 32G的芯片，目前Ascend 310P硬件暂时不支持

您好，感谢回复，如果我想自己适配的话，有哪些工作需要做呢

你好，请问最后解决了吗？

还没有；我又试了cann内置的hccl_test进行性能测试看看，依然报错内存申请失败。反馈给ascend等回复

Nov 06 '25 08:11 fanzetian