check_redfish icon indicating copy to clipboard operation
check_redfish copied to clipboard

Check health values on different chassis than System.Embedded.1

Open manfredw opened this issue 4 years ago • 21 comments

I'm looking for a way to monitor components that are not accessible via the main board of the server.

The hardware is a combination of a DELL R540 server and a direct via SAS attached MD1400 HDD shelf. Redfish reports 3 different chassis:

"Members": [
        {
            "@odata.id": "/redfish/v1/Chassis/System.Embedded.1"
        },
        {
            "@odata.id": "/redfish/v1/Chassis/Enclosure.Internal.0-1:RAID.Integrated.1-1"
        },
        {
            "@odata.id": "/redfish/v1/Chassis/Enclosure.External.0-0:RAID.Slot.4-1"
        }
    ],

There are power supplies, fans and other sensors on the external enclosure, but they are not reported by queries. (Disks from the internal enclosure are included) Should the script iterate over all members?

Best Regards, Manfred

manfredw avatar Sep 29 '21 15:09 manfredw

Hi,

Well, this is a bit difficult.

The script would iterate over the chassis but the DELL iDRAC has/had a bug that when requesting these resources the iDRAC would crash and reboot. 🤷‍♂️

Thats why chassis with RAID in name in DELL systems are skipped.

You could check this here: https://github.com/bb-Ricardo/check_redfish/blob/ffef06b85c7877d96e4f09ae4b652b2d9ba8ca81/cr_module/classes/redfish.py#L600

And comment out line 600 and 601 and see what happens.

Let me know about your findings.

You might need to delete the seasion file!

bb-Ricardo avatar Sep 29 '21 15:09 bb-Ricardo

Thx for the quick response, it helped :-)

The good news is that the server with the attached RAID is a newer one, running with firmware 5.00.00.00 on iDRAC9 - and there was no crash after removing the exception handler.

An older server with iDRAC 8, firmware 2.75.75.75 still crashes, the crash is not shown in MEL/SEL. Tomorrow I will test newest firmware 2.81.81.81. I will also try to vary the if statement to be more specific, i.e. continue only on RAID.Integrated or Enclosure.Internal as workaround. Status data are also provided via SNMP, but that's only plan B.

Did anybody open a case with DELL Support for that issue? Is there a specific redfish query part that causes the crash?

manfredw avatar Sep 29 '21 19:09 manfredw

Hi,

This sounds sort of promising.

Last ressort would be to check the idrac firmware version and change the behavior depending on it. But usually we have to traverse into manager and get the firmware version from there. This would add another layer of checks specific for DELL.

I don't think anyone (I know) opened a case at DELL to address this topic.

Let me know about the findings with newer versions.

I also would highly appreciate a 5.00 Refish MockUp. Have a look for RedfiahMockUpCreator. Then I could include it into my test suite.

Cheers

bb-Ricardo avatar Sep 29 '21 19:09 bb-Ricardo

Hi, Any updates on this issue?

bb-Ricardo avatar Nov 02 '21 13:11 bb-Ricardo

Hi, I've deployed monitoring based on this check script for our standard hosts (power/fan/storage/...) with the newest iDRAC firmware - everything is running stable :-).

I've still changed the filter condition in the script classes/redfish.py like mentioned above to avoid stability issues in production environment, but commented out the script condition in lab environment. iDRAC9 with firmware 5.x within lab are OK, but iDRAC8 with firmware 2.81.81.81 still crashes. I will try to open a case at DELL support during the next weeks, but this will take some time due to some business trips. Maybe I will need your support if DELL asks for details regarding the specific redfish queries.

A strange behavior is that checks for power status on servers with external storage shelf (with it's own PS, fans,...) only reports the internal PS on the first run, subsequent runs are showing all power supplies.

First run: [OK]: All power supplies (2) are in good condition and Power redundancy 1 status is: Enabled and 41 Voltages are OK|'ps_1'=114 'ps_2'=0 'voltage_CPU1_VCORE_VR'=1.79 'voltage_CPU2_VCORE_VR'=1.8 'voltage_CPU1_MEM012_VR'=1.22 'voltage_CPU1_MEM345_VR'=1.22 'voltage_CPU2_MEM012_VR'=1.22 'voltage_CPU2_MEM345_VR'=1.22 'voltage_PS1_Voltage_1'=230.0 'voltage_PS2_Voltage_2'=226.0

Subsequent runs:

[OK]: Chassi System.Embedded.1 : All power supplies (2) are in good condition and Power redundancy 1 status is: Enabled and 41 Voltages are OK
[OK]: Chassi Enclosure.External.0-0:RAID.Slot.4-1 : All power supplies (2) are in good condition|'ps_System.Embedded.1.1'=114 'ps_System.Embedded.1.2'=0 'voltage_System.Embedded.1.CPU1_VCORE_VR'=1.79 'voltage_System.Embedded.1.CPU2_VCORE_VR'=1.8 'voltage_System.Embedded.1.CPU1_MEM012_VR'=1.22 'voltage_System.Embedded.1.CPU1_MEM345_VR'=1.22 'voltage_System.Embedded.1.CPU2_MEM012_VR'=1.22 'voltage_System.Embedded.1.CPU2_MEM345_VR'=1.22 'voltage_System.Embedded.1.PS1_Voltage_1'=230.0 'voltage_System.Embedded.1.PS2_Voltage_2'=226.0

PS: feature request for changing/disabling the performance parameter name prefix to avoid double voltage_...voltage or temp_...Temperature values.

manfredw avatar Nov 02 '21 15:11 manfredw

Hi, any updates from Dell?

PS: feature request for changing/disabling the performance parameter name prefix to avoid double voltage_...voltage or temp_...Temperature values.

Rather not. It is a generic prefix to distinguish the different metrics. And every vendor has another naming schema. I know it looks a bit silly but changing this would make it really hard in your graphing tool to separate them.

bb-Ricardo avatar Dec 03 '21 21:12 bb-Ricardo

Hi, as written above we've setup several icinga checks to monitor cpu/ram/storage/power/fans/temp on 100+ systems. These systems are made of standard components with a raid controller, alle are equiped with the same latest iDRAC firmware. After a closer look this setup is not really stable.

It seems that the redfish API crashes multiple times a day, but there are no logs within iDRAC to identify queries that could cause the issues. Checks are usually made every 5 mins, we can only presume that there was a crash because some checks time out. The API seems to recover automaticaly after 2-3 minutes, but we were not able to isolate the issue on a specific redfish query.

The trigger could be a (successful?) query that is made before a following check ends up in a timeout or with broken json data. It also could be a system overload when queries are made with insufficient time gaps between them. I still did not open a case at DELL support because we are not able to give them hints what's going wrong. We have to change icinga event logging to store all check results instead of only storing state changes, then we should be able to identify the perpetrator.

I'll keep you informed...

manfredw avatar Dec 07 '21 16:12 manfredw

Thank you very for this detailed update.

I can imagine that this is a condition between different request in the redfish implementation and in that event the redfish daemon dies and gets restarted and then the check is working again. Just a guess.

bb-Ricardo avatar Dec 08 '21 14:12 bb-Ricardo

Hi, Was wondering if you got any updates on this issue.

bb-Ricardo avatar Jan 23 '22 16:01 bb-Ricardo

ping

bb-Ricardo avatar Mar 03 '22 22:03 bb-Ricardo

Currently have to work on other projects, so there is nearly no progress.

Still crashes with newer firmware 5.10.0.0.

Also tried Redfish-Protocol-Validator to find possible issues. It reports some failures with an unprivileged user (maybe due to missing rights, mainly in context with SSE), with admin user the tool looses connection to iDRAC during tests (seems that API also crashes).

manfredw avatar Mar 15 '22 17:03 manfredw

Thank you for reporting back. This seems like Dell needs some 5.X releases to get the iDRAC Redfish implementation stable.

bb-Ricardo avatar Mar 15 '22 20:03 bb-Ricardo

Hi. Was wondering if there are any updates on this issue.

bb-Ricardo avatar May 08 '22 15:05 bb-Ricardo

Currently no good news, latest Dell Updates (iDRAC 8: 2.83.83.83, iDRAC 9: 6.00.00.00) seems to have more redfish issues than the older versions. We try to use latest versions to address security vulnerabilities, but when there are issues with redfish we have to stay on older versions.

manfredw avatar Jun 30 '22 07:06 manfredw

Hi, this is doesn't sound promising.

Does the plugin work with 6.00.00.00?

bb-Ricardo avatar Jun 30 '22 08:06 bb-Ricardo

Not really, a short test with 6.0.0.0 caused some errors so we decided to downgrade to 5.10.30.00. Changelog of 6.0.0.0 said that they implemented some new redfish features (2020.3 and 2020.4), https://www.dell.com/support/home/de-at/drivers/driversdetails?driverid=r0h6y&oscode=wst14&productcode=poweredge-r440

manfredw avatar Jun 30 '22 10:06 manfredw

would you be able to create a MockUp and of a server with 6.00.00.00 iDRAC firmware and send it to me? Then I could try to add support for it and see where it fails.

bb-Ricardo avatar Jun 30 '22 11:06 bb-Ricardo

currently not, because I've no matching lab servers available until Sept/Oct.

manfredw avatar Jun 30 '22 11:06 manfredw

Hey, any updates on this topic?

bb-Ricardo avatar Nov 10 '22 21:11 bb-Ricardo

Still no real new situation. Firmware 2.82.82.82 works on the older Gen8 iDRACs, 2.83.83.83 make troubles. This hardware has less CPU power than the newer generation, requests take long time in general.

I'm awaiting delivery of some new Poweredge Rx50 with Gen9 iDRAC in December to do some more testing. Currently firmware 5.10.50.0 works on Rx40 servers, version 6.x completely fails. There was an announcement that new Redfish 2019.3, 2019.4, 2020.1 und 2021.2 specifications are supported starting with firmware 5.10.x.x, but I don't know the consequences of this change. CPU power on Gen9 is better, but it seems that it is overloaded with more complex requests or redfish requests are limited in background.

I can imagine that reducing the number of Redfish requests would lead to a better result, i.e. by querying more/all data in one request instead of iterating over each object. There is a query parameter named $expand that seems to support this, but I don't know if it already in use or all manufacturers support it.

manfredw avatar Nov 12 '22 11:11 manfredw

Thank you for the update. The $expand parameter is used for some vendors but not all. It really depends on the version of the bmc and if it is supported properly. And some vendors implement it probably and some, naja, failed. You run in all sorts of invalid responses.

Dell and it's 6.00 firmware is currently a lost cause. Hope they get their bits together and fix this.

We also tried running 6.X on some of our machines but had to revert to the latest 5.x due to multiple issues.

bb-Ricardo avatar Nov 12 '22 14:11 bb-Ricardo

I will close this issue for now. If there are any news we can reopen it.

bb-Ricardo avatar Jun 06 '24 10:06 bb-Ricardo