hwraid icon indicating copy to clipboard operation
hwraid copied to clipboard

megacli and megaclisas-status kill conttroller's FW

Open romeor opened this issue 3 years ago • 17 comments

Hello, I've installed megacli, megaclisas-status from your repository and ran into an issue with my hardware. First, my HW:

Linux pve2 5.19.17-1-pve #1 SMP PREEMPT_DYNAMIC PVE 5.19.17-1 (Mon, 14 Nov 2022 20:25:12  x86_64 GNU/Linux
18:00.0 RAID bus controller: Broadcom / LSI MegaRAID 12GSAS/PCIe Secure SAS39xx

Raid is 3916 to be precise. Running latest FW: 

Firmware Package Build = 52.22.0-4571
Firmware Version = 5.220.02-3691
PSOC FW Version = 0x0017
PSOC Part Number = 15987-231-8GB
NVDATA Version = 5.2200.21-0585
CBB Version = 23.25.01.00
Bios Version = 7.22.00.0_0x07160300
HII Version = 07.22.03.00
HIIA Version = 07.22.03.00
Driver Name = megaraid_sas
Driver Version = 07.719.03.00-rc1


System Information
        Manufacturer: Supermicro
        Product Name: SYS-110P-WTR

The issue was: as soon as I run

megacli -AdpAllInfo -aALL or megaclisas-status (or periodic run of megaclisas-statusd)

My system freeze for a while, i was not able to write nor read from disk and dmesg was full of these errors:

1661.722811] megaraid_sas 0000:18:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[ 1661.722829] megaraid_sas 0000:18:00.0: FW in FAULT state Fault code:0x10000 subcode:0x0 func:megasas_wait_for_outstanding_fusion
[ 1661.722848] megaraid_sas 0000:18:00.0: resetting fusion adapter scsi0.
[ 1661.723202] megaraid_sas 0000:18:00.0: Outstanding fastpath IOs: 4
[ 1668.382749] megaraid_sas 0000:18:00.0: Waiting for FW to come to ready state
[ 1691.286479] megaraid_sas 0000:18:00.0: FW now in Ready state
[ 1691.286483] megaraid_sas 0000:18:00.0: FW now in Ready state
[ 1691.286684] megaraid_sas 0000:18:00.0: Current firmware supports maximum commands: 5101       LDIO threshold: 0
[ 1691.286687] megaraid_sas 0000:18:00.0: Performance mode :Balanced (latency index = 8)
[ 1691.286688] megaraid_sas 0000:18:00.0: FW supports sync cache        : Yes
[ 1691.286691] megaraid_sas 0000:18:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[ 1691.398489] megaraid_sas 0000:18:00.0: FW supports atomic descriptor : Yes
[ 1693.890459] megaraid_sas 0000:18:00.0: FW provided supportMaxExtLDs: 1       max_lds: 240
[ 1693.890471] megaraid_sas 0000:18:00.0: controller type       : MR(8192MB)
[ 1693.890476] megaraid_sas 0000:18:00.0: Online Controller Reset(OCR)  : Enabled
[ 1693.890479] megaraid_sas 0000:18:00.0: Secure JBOD support   : Yes
[ 1693.890482] megaraid_sas 0000:18:00.0: NVMe passthru support : Yes
[ 1693.890484] megaraid_sas 0000:18:00.0: FW provided TM TaskAbort/Reset timeout        : 6 secs/60 secs
[ 1693.890485] megaraid_sas 0000:18:00.0: JBOD sequence map support     : Yes
[ 1693.890486] megaraid_sas 0000:18:00.0: PCI Lane Margining support    : Yes
[ 1701.562362] megaraid_sas 0000:18:00.0: megasas_get_ld_map_info DCMD timed out, RAID map is disabled
[ 1708.170289] megaraid_sas 0000:18:00.0: Waiting for FW to come to ready state
[ 1728.026073] megaraid_sas 0000:18:00.0: FW now in Ready state
[ 1728.026077] megaraid_sas 0000:18:00.0: FW now in Ready state
[ 1728.026300] megaraid_sas 0000:18:00.0: Current firmware supports maximum commands: 5101       LDIO threshold: 0
[ 1728.026303] megaraid_sas 0000:18:00.0: Performance mode :Balanced (latency index = 8)
[ 1728.026304] megaraid_sas 0000:18:00.0: FW supports sync cache        : Yes
[ 1728.026306] megaraid_sas 0000:18:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[ 1728.402068] megaraid_sas 0000:18:00.0: FW supports atomic descriptor : Yes
[ 1728.550065] megaraid_sas 0000:18:00.0: FW provided supportMaxExtLDs: 1       max_lds: 240
[ 1728.550068] megaraid_sas 0000:18:00.0: controller type       : MR(8192MB)
[ 1728.550069] megaraid_sas 0000:18:00.0: Online Controller Reset(OCR)  : Enabled
[ 1728.550070] megaraid_sas 0000:18:00.0: Secure JBOD support   : Yes
[ 1728.550071] megaraid_sas 0000:18:00.0: NVMe passthru support : Yes
[ 1728.550072] megaraid_sas 0000:18:00.0: FW provided TM TaskAbort/Reset timeout        : 6 secs/60 secs
[ 1728.550074] megaraid_sas 0000:18:00.0: JBOD sequence map support     : Yes
[ 1728.550074] megaraid_sas 0000:18:00.0: PCI Lane Margining support    : Yes
[ 1736.149985] megaraid_sas 0000:18:00.0: megasas_get_ld_map_info DCMD timed out, RAID map is disabled
[ 1742.837909] megaraid_sas 0000:18:00.0: Waiting for FW to come to ready state
[ 1762.581695] megaraid_sas 0000:18:00.0: FW now in Ready state
[ 1762.581700] megaraid_sas 0000:18:00.0: FW now in Ready state
[ 1762.581901] megaraid_sas 0000:18:00.0: Current firmware supports maximum commands: 5101       LDIO threshold: 0
[ 1762.581904] megaraid_sas 0000:18:00.0: Performance mode :Balanced (latency index = 8)
[ 1762.581905] megaraid_sas 0000:18:00.0: FW supports sync cache        : Yes
[ 1762.581907] megaraid_sas 0000:18:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[ 1762.985689] megaraid_sas 0000:18:00.0: FW supports atomic descriptor : Yes
[ 1763.145688] megaraid_sas 0000:18:00.0: FW provided supportMaxExtLDs: 1       max_lds: 240
[ 1763.145690] megaraid_sas 0000:18:00.0: controller type       : MR(8192MB)
[ 1763.145692] megaraid_sas 0000:18:00.0: Online Controller Reset(OCR)  : Enabled
[ 1763.145693] megaraid_sas 0000:18:00.0: Secure JBOD support   : Yes
[ 1763.145694] megaraid_sas 0000:18:00.0: NVMe passthru support : Yes
[ 1763.145695] megaraid_sas 0000:18:00.0: FW provided TM TaskAbort/Reset timeout        : 6 secs/60 secs
[ 1763.145697] megaraid_sas 0000:18:00.0: JBOD sequence map support     : Yes
[ 1763.145698] megaraid_sas 0000:18:00.0: PCI Lane Margining support    : Yes
[ 1763.145699] megaraid_sas 0000:18:00.0: return -EBUSY from megasas_refire_mgmt_cmd 4362 cmd 0x5 opcode 0x10b0100
[ 1763.145732] megaraid_sas 0000:18:00.0: return -EBUSY from megasas_mgmt_fw_ioctl 8408 cmd 0x5 opcode 0x10b0100 cmd->cmd_status_drv 0x3
[ 1763.145782] megaraid_sas 0000:18:00.0: waiting for controller reset to finish
[ 1763.205697] megaraid_sas 0000:18:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
[ 1763.205984] megaraid_sas 0000:18:00.0: Adapter is OPERATIONAL for [scsi:0](https://mail.tlulib.ee/scsi:0)
[ 1763.206131] megaraid_sas 0000:18:00.0: Snap dump wait time   : 15
[ 1763.206132] megaraid_sas 0000:18:00.0: Reset successful for scsi0.
[ 1763.206295] megaraid_sas 0000:18:00.0: 10672 (722633074s/0x0020/DEAD) - Fatal firmware error: Line 188 in fw\raid\utils.c

[ 1763.206572] megaraid_sas 0000:18:00.0: 10675 (722633081s/0x0020/CRIT) - Controller encountered an error and was reset
[ 1763.211401] megaraid_sas 0000:18:00.0: scanning for scsi0...
[ 1763.211666] megaraid_sas 0000:18:00.0: 10719 (722633106s/0x0020/DEAD) - Fatal firmware error: Line 188 in fw\raid\utils.c

[ 1763.211963] megaraid_sas 0000:18:00.0: 10722 (722633113s/0x0020/CRIT) - Controller encountered an error and was reset
[ 1763.218960] megaraid_sas 0000:18:00.0: scanning for scsi0...
[ 1763.221603] megaraid_sas 0000:18:00.0: 10765 (722633133s/0x0020/DEAD) - Fatal firmware error: Line 188 in fw\raid\utils.c

[ 1763.221742] megaraid_sas 0000:18:00.0: 10768 (722633140s/0x0020/CRIT) - Controller encountered an error and was reset
[ 1763.226380] megaraid_sas 0000:18:00.0: scanning for scsi0...

nothing happens with megaraidsas-status and latest storcli, that i got from broadcom site.

Could you please fix or add storcli (ubuntu pkg is available from broadcom site https://www.broadcom.com/products/storage/raid-controllers/megaraid-9560-16i

romeor avatar Nov 24 '22 20:11 romeor

Hi, On such a recent kernel and controller, perhaps megacli no longer work (that binary has not been updated in years). Could you try using storcli? megaclisas-status supports both.. Thanks, Vincent

ElCoyote27 avatar Nov 25 '22 02:11 ElCoyote27

Hi, it takes megacli as dependency and as soon as megaclisas-statusd starts, server hangs and FW crash happens. When i delete megacli string from megaclisas-status script and execute it, it says no controller found

# megaclisas-status
No MegaRAID or PERC adapter detected on your system!

while runing storcli on raid shows it ok

storcli /c0 /vall show
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.19.17-1-pve
Controller = 0
Status = Success
Description = None


Virtual Drives :
==============

---------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC       Size Name
---------------------------------------------------------------
1/238 RAID5 Optl  RW     Yes     RAWBD -   ON   12.221 TB DATA
0/239 RAID1 Optl  RW     Yes     RAWBD -   ON  223.062 GB OS
--------------------------------------------------------------

romeor avatar Dec 03 '22 19:12 romeor

Here are my recomendations:

  1. uninstall 'megacli' from your system (this will make the script not find it..) megacli is installed on your system and it crashes your system, you should uninstall it. megaclisas-status is barely calling it when it is first found in the PATH.
  2. type 'which storcli' to check where in the PATH is that CLI
  3. run megacilsas-status with '--debug' and paste the output here.

ElCoyote27 avatar Dec 03 '22 20:12 ElCoyote27

Hello,

Am unable to install megaclisas-status without megacli. And I can't remove megacli without removing megaclisas-status. They depend on each other. If i install megaclisas-status right the way, my system will crash again, as it also installs megaclisasstatusd, which runs right after installation and calls for megacli software...

romeor avatar Dec 06 '22 23:12 romeor

This must be because you're using your package manager and it has dependencies which co-bundle the two things together. megaclisas-status is just a self contained script that uses either megacli or storcli. In your situation, I would remove megacli since it crashes your system and just use the plain megaclisas-status script with storcli. You could install and distribute megaclisas-status outside of your package manager as it is only a script.

ElCoyote27 avatar Dec 07 '22 00:12 ElCoyote27

hello again.

it seems like your wrapper is not working with newer storcli binary.

I've installed storcli from server manufcator site (supermicro) modified your script

os.environ["PATH"] += os.pathsep + "/usr/bin/storcli"
# Find MegaCli
for megabin in "perccli64", "perccli", "storcli64", "storcli":

to exclude megacli from process.

# megaclisas-status
No MegaRAID or PERC adapter detected on your system!

please update

romeor avatar Jan 31 '23 10:01 romeor

Hi, I just got an H750P and I've noticed the following behaviour:

  • The old MegaCLI binary hangs the system (on RHEL8).
  • The old perccli (1.11 from 2014) which supports the Legacy MegaCLI syntax -also- hangs the system.
  • Only the new perccli (perccli-007.1623.0000.0000-1.noarch from 2020) does not hang the system

Unfortunately, the latest perccli/storcli no longer supports the old Legacy MegaCLI syntax so I guess we'll have to rewrite many parts of megaclisas-status.. (Maybe create a percclisas-status?)

:(

ElCoyote27 avatar Mar 05 '23 16:03 ElCoyote27

@romeor Hello, I encountered the exact same issue, while I only have storcli on my machine. Did you solve the problem?

Leox0717 avatar Mar 18 '24 09:03 Leox0717

Hello, @Leox0717

uninstall megacli and install storcli

romeor avatar Mar 18 '24 14:03 romeor

Sorry to bump this, I was just Googling for Line 188 in fw\raid\utils.c from dmesg and came across this - am using a Dell H750 card and made a few changes to the script to suit: https://gist.github.com/andrewladlow/9f4d03aab8ef0e957343b65ee6638c3a

Tested using perccli 007.0127, example output:

megaclisas-status
-- Controller information --
-- ID | H/W Model         | RAM    | Temp | BBU    | Firmware
c0    | PERC H750 Adapter | 8192MB | 42C  | Good   | FW: 52.21.0-4606

-- Array information --
-- ID  | Type   |    Size |  Strpsz | Flags | DskCache |   Status |  OS Path | CacheCade |InProgress
c0u239 | RAID-6 |  87313G |  256 KB | RA,WB |  Enabled |  Optimal |      239 | None      |None

-- Disk information --
-- ID     | Type | Drive Model                        | Size     | Status          | Speed    | Temp | Slot ID  | LSI ID
c0u239p0  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P9R9B | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 26C  | [64:0]   | 23
c0u239p1  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P97HF | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 26C  | [64:1]   | 21
c0u239p2  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P9QHE | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 26C  | [64:2]   | 25
c0u239p3  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P9XY9 | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 27C  | [64:3]   | 24
c0u239p4  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P9RRF | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 26C  | [64:4]   | 22
c0u239p5  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P8GSL | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 26C  | [64:5]   | 20
c0u239p6  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P9Z4Q | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 27C  | [64:6]   | 18
c0u239p7  | HDD  | ST16000NM005G-2KH133 EAL6 ZL2P82Z8 | 14.551 TB | Online, Spun Up | 6.0Gb/s  | 27C  | [64:7]   | 19

Not sure what the text would actually be for the BBU if it were to fail, just used [A-Za-z].* as a bit of a guess but this could end up not matching

andrewladlow avatar Apr 16 '24 19:04 andrewladlow

@andrewladlow Wow, that's great! I have a an H750P too, let me try your version.

ElCoyote27 avatar Apr 16 '24 20:04 ElCoyote27

Unfortunately, later versions of perccli removed the 'megacli' compatibility mode:

# rpm -q perccli
perccli-007.0127.0000.0000-1.noarch
# ./megaclisas-status 
-- Controller information --
-- ID | H/W Model         | RAM    | Temp | BBU    | Firmware     
c0    | PERC H750 Adapter | 8192MB | 49C  | Good   | FW: 52.26.0-5179 

-- Array information --
-- ID  | Type   |    Size |  Strpsz |   Flags | DskCache |   Status |  OS Path | CacheCade |InProgress   
c0u239 | RAID-0 |   1818G |  512 KB | ADRA,WB |  Enabled |  Optimal |      239 | None      |None         

-- Disk information --
-- ID     | Type | Drive Model                                      | Size     | Status          | Speed    | Temp | Slot ID  | LSI ID  
c0u239p0  | SSD  | S620NG0R208075X Samsung SSD 870 EVO 2TB SVT02B6Q | 1.818 TB | Online, Spun Up | 6.0Gb/s  | 26C  | [64:0]   | 8       

but if I upgrade perccli:

# rpm -q perccli
perccli-007.1910.0000.0000-1.noarch
# ./megaclisas-status 
No MegaRAID or PERC adapter detected on your system!

ElCoyote27 avatar Apr 16 '24 20:04 ElCoyote27

@andrewladlow There's an updated version here, btw: https://github.com/ElCoyote27/hwraid/blob/master/wrapper-scripts/megaclisas-status You seem to be using 1.78 and I have 1.87 in my fork.

ElCoyote27 avatar Apr 16 '24 20:04 ElCoyote27

Ah yeah I see what you mean, the script is trying to do -adpCount -NoLog but with the more recent version you just get:

CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 6.1.0-20-amd64
Status = Failure
Description = Deprecated command. Please use the new syntax.

The equivalent command seems to be show ctrlcount, but if you change that in the script you'll hit a similar syntax error when it tries to run -PDGetNum -a0 -NoLog for returnTotalDriveNumber (and so on), shame that it doesn't just accept the older syntax 😅

Thanks for mentioning the version by the way, didn't realise! Mine's from the Debian repo so must be a tad outdated by now

andrewladlow avatar Apr 16 '24 20:04 andrewladlow

What other changes did you add? I could only identity a %-5s vs %-6s on line 769. If you create a PR against my branch I'll review it. I know @eLvErDe has been super busy these past years so I have no idea if he'd be able to review a PR against the upstream.

ElCoyote27 avatar Apr 16 '24 20:04 ElCoyote27

If you run the script with --debug, you'll see the commands it executes:

e.g:

# megaclisas-status --debug 2>&1|grep perccli64|sort -u
# DEBUG (130) : Will use this executable: /opt/MegaRAID/perccli/perccli64
# DEBUG (165) : Got Cached value: /opt/MegaRAID/perccli/perccli64 -LDInfo -l239 -a0 -NoLog
# DEBUG (165) : Got Cached value: /opt/MegaRAID/perccli/perccli64 -LDInfo -lall -a0 -NoLog
# DEBUG (165) : Got Cached value: /opt/MegaRAID/perccli/perccli64 -LdPdInfo -a0 -NoLog
# DEBUG (165) : Got Cached value: /opt/MegaRAID/perccli/perccli64 -PDGetNum -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -AdpAllInfo -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -AdpBbuCmd -GetBbuStatus -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -adpCount -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -AdpGetPciInfo -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -LDInfo -l0 -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -LDInfo -l100 -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -LDInfo -l101 -a0 -NoLog
# DEBUG (168) : Not a Cached value: /opt/MegaRAID/perccli/perccli64 -LDInfo -l102 -a0 -NoLog

All of these would have to be rewritten for the newest perccli and the patterns/logic would need to be adjusted too.

ElCoyote27 avatar Apr 16 '24 21:04 ElCoyote27