rr icon indicating copy to clipboard operation
rr copied to clipboard

BUG ConnectX-4 Lx网卡在开机或重启时,有概率会掉网卡

Open gasment opened this issue 1 year ago • 1 comments

请填写以下信息.
Please fill in the following information.

Install ENV: (You can find it in the boot interface.)

  • DMI: qemu
  • CPU:
  • NIC: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]

RR version: (You can find it in the update menu.)

  • RR: 24.9.1
  • addons:
  • modules:
  • lkms:

DSM:

  • model: DS920+
  • version: 7.2.2

Issue:
ConnectX-4 Lx网卡在开机或重启时,有概率会初始化失败,无法创建eth导致失联,其他型号试过sa6400也一样 在RR阶段是无问题的,进度到dsm内核才会出现概率掉卡 logs:

SynologyNAS> [ 126.411146] mlx5_core 0000:00:11.0: 0000:00:11.0:wait_func:790:(pid 3851): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource [ 126.414546] mlx5_core 0000:00:11.0: 0000:00:11.0:page_notify_fail:308:(pid 3851): page notify failed [ 126.415060] mlx5_core 0000:00:11.0: 0000:00:11.0:wait_func:790:(pid 4999): ALLOC_UAR(0x802) timeout. Will cause a leak of a command resource [ 126.415062] mlx5_core 0000:00:11.0: 0000:00:11.0:mlx5_alloc_map_uar:237:(pid 4999): mlx5_cmd_alloc_uar() failed, -110 [ 126.415064] mlx5_core 0000:00:11.0: 0000:00:11.0:mlx5e_create_netdev:2141:(pid 4999): alloc_map uar failed, -110 [ 126.415203] udevd[4999]: failed to send result of seq 1835 to main daemon: Connection refused [ 126.426965] mlx5_core 0000:00:11.0: 0000:00:11.0:pages_work_handler:443:(pid 3851): give fail -110 ^C SynologyNAS> [ 186.428088] mlx5_core 0000:00:11.0: 0000:00:11.0:wait_func:790:(pid 3851): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource [ 186.431379] mlx5_core 0000:00:11.0: 0000:00:11.0:reclaim_pages:407:(pid 3851): failed reclaiming pages [ 186.433866] mlx5_core 0000:00:11.0: 0000:00:11.0:pages_work_handler:443:(pid 3851): reclaim fail -110 [ 246.436094] mlx5_core 0000:00:11.0: 0000:00:11.0:wait_func:790:(pid 3851): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource [ 246.439622] mlx5_core 0000:00:11.0: 0000:00:11.0:reclaim_pages:407:(pid 3851): failed reclaiming pages [ 246.442351] mlx5_core 0000:00:11.0: 0000:00:11.0:pages_work_handler:443:(pid 3851): reclaim fail -110 [ 306.445143] mlx5_core 0000:00:11.0: 0000:00:11.0:wait_func:790:(pid 3851): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource [ 306.447718] mlx5_core 0000:00:11.0: 0000:00:11.0:reclaim_pages:407:(pid 3851): failed reclaiming pages [ 306.449502] mlx5_core 0000:00:11.0: 0000:00:11.0:pages_work_handler:443:(pid 3851): reclaim fail -110

00:11.0 Class 0200: Device 15b3:1015 Subsystem: Device 15b3:0069 Flags: bus master, fast devsel, latency 0, IRQ 10 Memory at 7030000000 (64-bit, prefetchable) [size=32M] Expansion ROM at c1600000 [disabled] [size=1M] Capabilities: [60] Express Endpoint, IntMsgNum 0 Capabilities: [48] Vital Product Data Capabilities: [9c] MSI-X: Enable+ Count=64 Masked- Capabilities: [c0] Vendor Specific Information: Len=18 <?> Capabilities: [40] Power Management version 3 Kernel driver in use: mlx5_core

(## 因为 log中存在 SN/MAC 等一些敏感信息, 当提供完整文件时请自行抹除他们, 当然你也可以发送到我的邮箱. ##)
(## Because the log contains some sensitive information such as SN/MAC, please delete them when providing the complete file. Of course, you can also send it to my email. ##)
...

(请先看一下#173、#175、#226的内容)
(Plz review the content of #173, #175, #226 first)
...

(如果你只是说 XXX 不能用, 什么详细信息也不提供, 我也只能说感谢你的反馈.)
(If you just say XXX doesn't work without providing any details, I can only say thank you for your feedback.)
...

gasment avatar Sep 30 '24 15:09 gasment

test v24.12.1

wjz304 avatar Dec 05 '24 13:12 wjz304