在AMD 2024年7月的以后的AMD驱动上没效果,在2025年的N卡驱动推理就卡死了,用老版本的显卡驱动就正常
detail | 详细描述 | 詳細な説明
如题
用的yolov8模型
英伟达最新版驱动报错: vkWaitForFences failed -4 FATAL ERROR! reclaim_blob_allocator get wild allocator 000001515ADBB770 FATAL ERROR! reclaim_staging_allocator get wild allocator 000001515ADBB650 vkQueueSubmit failed -4 FATAL ERROR! reclaim_blob_allocator get wild allocator 000001515ADBB770 FATAL ERROR! reclaim_staging_allocator get wild allocator 000001515ADBB650 vkQueueSubmit failed -4 FATAL ERROR! reclaim_blob_allocator get wild allocator 000001515ADBB770 FATAL ERROR! reclaim_staging_allocator get wild allocator 000001515ADBB650 vkQueueSubmit failed -4 FATAL ERROR! reclaim_blob_allocator get wild allocator 000001515ADBB770 FATAL ERROR! reclaim_staging_allocator get wild allocator 000001515ADBB650
使用ncnn的example复现吗
代码:https://github.com/Tencent/ncnn/blob/master/examples/yolov8.cpp 模型:https://github.com/nihui/ncnn-assets/tree/master/models
使用ncnn的example复现吗
代码:https://github.com/Tencent/ncnn/blob/master/examples/yolov8.cpp 模型:https://github.com/nihui/ncnn-assets/tree/master/models
确实是模型的问题,训练用的ultralytics 8.3版本,我应该用哪个版本呢?
使用了yolov8.2.103版本,官方的yolov8s、yolov8n都只正常使用,训练后的s模型还是报如下的错误,训练后的n模型可以正常使用了:[0 NVIDIA GeForce RTX 3070 Ti Laptop GPU] queueC=2[8] queueG=0[16] queueT=1[2] [0 NVIDIA GeForce RTX 3070 Ti Laptop GPU] bugsbn1=0 bugbilz=0 bugcopc=0 bugihfa=0 [0 NVIDIA GeForce RTX 3070 Ti Laptop GPU] fp16-p/s/u/a=1/1/1/1 int8-p/s/u/a=1/1/1/1 [0 NVIDIA GeForce RTX 3070 Ti Laptop GPU] subgroup=32 basic/vote/ballot/shuffle=1/1/1/1 [0 NVIDIA GeForce RTX 3070 Ti Laptop GPU] fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/1/1/1 vkWaitForFences failed -4 vkQueueSubmit failed -4
https://github.com/Tencent/ncnn/pull/5953
请尝试使用这个 pr 分支的代码测试下能否解决卡死问题 https://github.com/Tencent/ncnn/pull/5953
使用ncnn的example复现吗
代码:https://github.com/Tencent/ncnn/blob/master/examples/yolov8.cpp 模型:https://github.com/nihui/ncnn-assets/tree/master/models
yolov4也出现了这个问题,用于测试的模型和代码一直没变动过 vkWaitForFences failed -4 vkQueueSubmit failed -4 opt.zip
@1027663760 @xiaoxingbobo
临时workaround:尝试加载模型前这样设置,能不能避免问题
#if defined _WIN32
// workaround for windows + nvidia gpu + driver > 566
const ncnn::GpuInfo& gpu_info = ncnn::get_gpu_info(gpu_device);
if (gpu_info.vendor_id() == 0x10de && gpu_info.driver_id() == 4)
{
int driver_version_major = (int)atof(gpu_info.queryDriverProperties().driverInfo);
if (driver_version_major > 565)
{
opt.use_shader_local_memory = false;
opt.use_cooperative_matrix = false;
}
}
#endif
临时workaround:尝试加载模型前这样设置,能不能避免问题
#if defined _WIN32 // workaround for windows + nvidia gpu + driver > 566 const ncnn::GpuInfo& gpu_info = ncnn::get_gpu_info(gpu_device); if (gpu_info.vendor_id() == 0x10de && gpu_info.driver_id() == 4) { int driver_version_major = (int)atof(gpu_info.queryDriverProperties().driverInfo); if (driver_version_major > 565) { opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; } } #endif
反馈:经过这样设置后就不会报错了
定位到这个conv实现有可能越界写数据了
Convolution [ 40, 32, 8 *16] -> [ 40, 32, 3 *8] kernel: 1 x 1 stride: 1 x 1
关联
Concat 16_153 3 1 15_145_bn_mish 14_137_bn_mish_split_1 13_129_bn_mish_split_1 16_153 -23330=4,3,40,28,192 31=16
Convolution 17_156 1 1 16_153 17_156_bn_mish -23330=4,3,40,28,128 0=128 1=1 5=1 6=24576 9=5
定位到这个conv实现有可能越界写数据了
Convolution [ 40, 32, 8 *16] -> [ 40, 32, 3 *8] kernel: 1 x 1 stride: 1 x 1
顺便看看这个问题呢 https://github.com/Tencent/ncnn/issues/5695
另一种需要改ncnn源码 allocator.h 的workaround方法,能尽量少降速
explicit VkWeightAllocator(const VulkanDevice* vkdev, size_t preferred_block_size = 0); // 8M
https://github.com/Tencent/ncnn/pull/6102
临时workaround:尝试加载模型前这样设置,能不能避免问题
#if defined _WIN32 // workaround for windows + nvidia gpu + driver > 566 const ncnn::GpuInfo& gpu_info = ncnn::get_gpu_info(gpu_device); if (gpu_info.vendor_id() == 0x10de && gpu_info.driver_id() == 4) { int driver_version_major = (int)atof(gpu_info.queryDriverProperties().driverInfo); if (driver_version_major > 565) { opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; } } #endif
二次反馈: 设置 opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; 然后使用CPU模式,ncnn会崩溃,是不是这两个选项有什么奇怪的兼容性问题
@1027663760 @xiaoxingbobo 临时workaround:尝试加载模型前这样设置,能不能避免问题 #if defined _WIN32 // workaround for windows + nvidia gpu + driver > 566 const ncnn::GpuInfo& gpu_info = ncnn::get_gpu_info(gpu_device); if (gpu_info.vendor_id() == 0x10de && gpu_info.driver_id() == 4) { int driver_version_major = (int)atof(gpu_info.queryDriverProperties().driverInfo); if (driver_version_major > 565) { opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; } } #endif
二次反馈: 设置 opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; 然后使用CPU模式,ncnn会崩溃,是不是这两个选项有什么奇怪的兼容性问题
这两个开关理应只作用于gpu代码中 以及请尝试 https://github.com/Tencent/ncnn/pull/6102 中的修改
@1027663760 @xiaoxingbobo 临时workaround:尝试加载模型前这样设置,能不能避免问题 #if defined _WIN32 // workaround for windows + nvidia gpu + driver > 566 const ncnn::GpuInfo& gpu_info = ncnn::get_gpu_info(gpu_device); if (gpu_info.vendor_id() == 0x10de && gpu_info.driver_id() == 4) { int driver_version_major = (int)atof(gpu_info.queryDriverProperties().driverInfo); if (driver_version_major > 565) { opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; } } #endif
二次反馈: 设置 opt.use_shader_local_memory = false; opt.use_cooperative_matrix = false; 然后使用CPU模式,ncnn会崩溃,是不是这两个选项有什么奇怪的兼容性问题
这两个开关理应只作用于gpu代码中 以及请尝试 #6102 中的修改
use_shader_local_memory和use_cooperative_matrix改false有效, 单独改#6102 就没效