ncnn 在AMD 2024年7月的以后的AMD驱动上没效果，在2025年的N卡驱动推理就卡死了，用老版本的显卡驱动就正常

detail | 详细描述 | 詳細な説明

如题

Feb 25 '25 02:02 xiaoxingbobo

用的yolov8模型

Feb 25 '25 02:02 xiaoxingbobo

英伟达最新版驱动报错： vkWaitForFences failed -4 FATAL ERROR! reclaim_blob_allocator get wild allocator 000001515ADBB770 FATAL ERROR! reclaim_staging_allocator get wild allocator 000001515ADBB650 vkQueueSubmit failed -4 FATAL ERROR! reclaim_blob_allocator get wild allocator 000001515ADBB770 FATAL ERROR! reclaim_staging_allocator get wild allocator 000001515ADBB650 vkQueueSubmit failed -4 FATAL ERROR! reclaim_blob_allocator get wild allocator 000001515ADBB770 FATAL ERROR! reclaim_staging_allocator get wild allocator 000001515ADBB650 vkQueueSubmit failed -4 FATAL ERROR! reclaim_blob_allocator get wild allocator 000001515ADBB770 FATAL ERROR! reclaim_staging_allocator get wild allocator 000001515ADBB650

Mar 07 '25 06:03 xiaoxingbobo

使用ncnn的example复现吗

代码：https://github.com/Tencent/ncnn/blob/master/examples/yolov8.cpp 模型：https://github.com/nihui/ncnn-assets/tree/master/models

Mar 07 '25 07:03 nihui

使用ncnn的example复现吗

代码：https://github.com/Tencent/ncnn/blob/master/examples/yolov8.cpp 模型：https://github.com/nihui/ncnn-assets/tree/master/models

确实是模型的问题，训练用的ultralytics 8.3版本，我应该用哪个版本呢？

Mar 12 '25 10:03 xiaoxingbobo

使用了yolov8.2.103版本，官方的yolov8s、yolov8n都只正常使用，训练后的s模型还是报如下的错误，训练后的n模型可以正常使用了：[0 NVIDIA GeForce RTX 3070 Ti Laptop GPU] queueC=2[8] queueG=0[16] queueT=1[2] [0 NVIDIA GeForce RTX 3070 Ti Laptop GPU] bugsbn1=0 bugbilz=0 bugcopc=0 bugihfa=0 [0 NVIDIA GeForce RTX 3070 Ti Laptop GPU] fp16-p/s/u/a=1/1/1/1 int8-p/s/u/a=1/1/1/1 [0 NVIDIA GeForce RTX 3070 Ti Laptop GPU] subgroup=32 basic/vote/ballot/shuffle=1/1/1/1 [0 NVIDIA GeForce RTX 3070 Ti Laptop GPU] fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/1/1/1 vkWaitForFences failed -4 vkQueueSubmit failed -4

Mar 13 '25 10:03 xiaoxingbobo

https://github.com/Tencent/ncnn/pull/5953

Apr 03 '25 08:04 nihui

请尝试使用这个 pr 分支的代码测试下能否解决卡死问题 https://github.com/Tencent/ncnn/pull/5953

Apr 21 '25 11:04 nihui

使用ncnn的example复现吗

代码：https://github.com/Tencent/ncnn/blob/master/examples/yolov8.cpp 模型：https://github.com/nihui/ncnn-assets/tree/master/models

yolov4也出现了这个问题，用于测试的模型和代码一直没变动过 vkWaitForFences failed -4 vkQueueSubmit failed -4 opt.zip

May 03 '25 10:05 1027663760

@1027663760 @xiaoxingbobo

临时workaround：尝试加载模型前这样设置，能不能避免问题

#if defined _WIN32
    // workaround for windows + nvidia gpu + driver > 566
    const ncnn::GpuInfo& gpu_info = ncnn::get_gpu_info(gpu_device);
    if (gpu_info.vendor_id() == 0x10de && gpu_info.driver_id() == 4)
    {
        int driver_version_major = (int)atof(gpu_info.queryDriverProperties().driverInfo);
        if (driver_version_major > 565)
        {
            opt.use_shader_local_memory = false;
            opt.use_cooperative_matrix = false;
        }
    }
#endif

May 25 '25 12:05 nihui

@1027663760 @xiaoxingbobo

临时workaround：尝试加载模型前这样设置，能不能避免问题

#if defined _WIN32 // workaround for windows + nvidia gpu + driver > 566 const ncnn::GpuInfo& gpu_info = ncnn::get_gpu_info(gpu_device); if (gpu_info.vendor_id() == 0x10de && gpu_info.driver_id() == 4) { int driver_version_major = (int)atof(gpu_info.queryDriverProperties().driverInfo); if (driver_version_major > 565) { opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; } } #endif

反馈：经过这样设置后就不会报错了

May 26 '25 12:05 1027663760

定位到这个conv实现有可能越界写数据了

Convolution     [ 40,  32,   8 *16] -> [ 40,  32,   3 *8]      kernel: 1 x 1     stride: 1 x 1

关联

Concat                   16_153                   3 1 15_145_bn_mish 14_137_bn_mish_split_1 13_129_bn_mish_split_1 16_153 -23330=4,3,40,28,192 31=16
Convolution              17_156                   1 1 16_153 17_156_bn_mish -23330=4,3,40,28,128 0=128 1=1 5=1 6=24576 9=5

May 26 '25 15:05 nihui

定位到这个conv实现有可能越界写数据了
Convolution     [ 40,  32,   8 *16] -> [ 40,  32,   3 *8]      kernel: 1 x 1     stride: 1 x 1

顺便看看这个问题呢 https://github.com/Tencent/ncnn/issues/5695

May 26 '25 15:05 1027663760

另一种需要改ncnn源码 allocator.h 的workaround方法，能尽量少降速

explicit VkWeightAllocator(const VulkanDevice* vkdev, size_t preferred_block_size = 0); // 8M

May 26 '25 16:05 nihui

https://github.com/Tencent/ncnn/pull/6102

Jun 02 '25 15:06 nihui

@1027663760 @xiaoxingbobo

临时workaround：尝试加载模型前这样设置，能不能避免问题

#if defined _WIN32 // workaround for windows + nvidia gpu + driver > 566 const ncnn::GpuInfo& gpu_info = ncnn::get_gpu_info(gpu_device); if (gpu_info.vendor_id() == 0x10de && gpu_info.driver_id() == 4) { int driver_version_major = (int)atof(gpu_info.queryDriverProperties().driverInfo); if (driver_version_major > 565) { opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; } } #endif

二次反馈：设置 opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; 然后使用CPU模式，ncnn会崩溃，是不是这两个选项有什么奇怪的兼容性问题

Jun 14 '25 16:06 1027663760

@1027663760 @xiaoxingbobo 临时workaround：尝试加载模型前这样设置，能不能避免问题 #if defined _WIN32 // workaround for windows + nvidia gpu + driver > 566 const ncnn::GpuInfo& gpu_info = ncnn::get_gpu_info(gpu_device); if (gpu_info.vendor_id() == 0x10de && gpu_info.driver_id() == 4) { int driver_version_major = (int)atof(gpu_info.queryDriverProperties().driverInfo); if (driver_version_major > 565) { opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; } } #endif

二次反馈：设置 opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; 然后使用CPU模式，ncnn会崩溃，是不是这两个选项有什么奇怪的兼容性问题

这两个开关理应只作用于gpu代码中以及请尝试 https://github.com/Tencent/ncnn/pull/6102 中的修改

Jun 15 '25 16:06 nihui

@1027663760 @xiaoxingbobo 临时workaround：尝试加载模型前这样设置，能不能避免问题 #if defined _WIN32 // workaround for windows + nvidia gpu + driver > 566 const ncnn::GpuInfo& gpu_info = ncnn::get_gpu_info(gpu_device); if (gpu_info.vendor_id() == 0x10de && gpu_info.driver_id() == 4) { int driver_version_major = (int)atof(gpu_info.queryDriverProperties().driverInfo); if (driver_version_major > 565) { opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; } } #endif

二次反馈：设置 opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; 然后使用CPU模式，ncnn会崩溃，是不是这两个选项有什么奇怪的兼容性问题

这两个开关理应只作用于gpu代码中以及请尝试 #6102 中的修改

use_shader_local_memory和use_cooperative_matrix改false有效，单独改#6102 就没效

Jul 04 '25 03:07 zylo117