iDRAC-Redfish-Scripting icon indicating copy to clipboard operation
iDRAC-Redfish-Scripting copied to clipboard

iDRAC Version 7.00.00.171 Traceback Errors

Open downtownle opened this issue 1 year ago • 14 comments

Hello Texas,

with iDRAC version 7.00.00.171 we have increased traceback errors, it seems as if the iDRAC no longer responds properly. Here are 2 examples of how Icinga reacted to this: picture1

File "/usr/lib64/nagios/plugins/check_redfish.py", line 178, in plugin.do_exit() File "/usr/lib64/nagios/plugins/dtag/check_redfish/cr_module/classes/plugin.py", line 427, in do_exit print(self.return_output_data()) File "/usr/lib64/nagios/plugins/dtag/check_redfish/cr_module/classes/plugin.py", line 303, in return_output_data for command in sorted(self.__output_data.get_commands(), key=lambda x: output_order.index(x)): File "/usr/lib64/nagios/plugins/dtag/check_redfish/cr_module/classes/plugin.py", line 303, in for command in sorted(self.__output_data.get_commands(), key=lambda x: output_order.index(x)): ValueError: 'global' is not in list

And from another server:

{ "error": { "@Message.ExtendedInfo": [{ "Message": "The requested operation cannot be completed because of an internal error.", "MessageArgs": [], "[email protected]": 0, "MessageId": "IDRAC.2.8.SYS446", "RelatedProperties": [], "[email protected]": 0, "Resolution": "Retry the operation after a few minutes. If the issue persists, contact your service provider.", "Severity": "Critical" }, { "Message": "The request failed due to an internal service error. The service is still operational.", "MessageArgs": [], "[email protected]": 0, "MessageId": "Base.1.12.InternalError", "RelatedProperties": [], "[email protected]": 0, "Resolution": "Resubmit the request. If the problem persists, consider resetting the service.", "Severity": "Critical" } ], "code": "Base.1.12.GeneralError", "message": "A general error has occurred. See ExtendedInfo for more information" } }

I think it's not a login problem (we don't get a bad request 400), it seems more like a problem keeping the sessions/cleanly ending the Redfish call when I do a closessn -a, i.e. then manually end all logins works it again. To me this suggests that it keeps the session ID but does not pass the request data to this session ID ("The request failed due to an internal service error. The service is still operational."). Other checks with different session IDs continue to run smoothly. It may also have something to do with the size of the response ("The requested operation cannot be completed because of an internal error."), because they are usually checks with performance data or log entries (fan rotation, Mem utilization, MEL/SEL). be given.

downtownle avatar Mar 18 '24 13:03 downtownle

Hi @downtownle

Can you share the previous iDRAC version you were using before updating to 7.00.00.171 and did this version also have traceback errors?

For the internal error response do you know what URI(s) were being called for GET requests?

Thanks Tex

texroemer avatar Mar 18 '24 14:03 texroemer

Hello Tex,

the previous version was 6.10.80.00, A00

/redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/Entries?$skip=1350 (for logs, for example, it fetches the first 50 entries, then the next 50, etc.)

Otherwise, the plugin uses auto-discover to search for specific paths for e.g. FANs etc. and then wants to have the corresponding data.

We use the Icinga plugin CheckRedfish (https://github.com/bb-Ricardo/check_redfish [github.com])

downtownle avatar Mar 19 '24 11:03 downtownle

Thanks for the details, can you confirm with 6.10.80 you also see traceback errors or you only see the issue with 7.00.00.171?

Thanks Tex

texroemer avatar Mar 19 '24 12:03 texroemer

Hello Tex,

I can't rule out the possibility that it happened, but if it did, it was very, very rare and never noticed. With the new version we have 1 to 2 servers per day.

downtownle avatar Mar 19 '24 13:03 downtownle

Hello Tex,

4 again today, all log checks. It actually seems to have something to do with the size of the data to be received. Have you maybe a chance to check this in your lab? export1

downtownle avatar Mar 20 '24 12:03 downtownle

Hi @downtownle

Last night i looped (1000 loops) check_redfish.py with --all argument, unable to repro the issue.

One the server which repro the issue can you send me the value for "redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/[email protected]"?

Thanks Tex

texroemer avatar Mar 20 '24 21:03 texroemer

Hello Tex,

i added the command and parameters when you scroll down. 14G system R740xd idrac9 with fw version 7.00.00.171

Command which is executed: '/usr/lib64/nagios/plugins/check_redfish.py' '--host' '192.168.0.120' '--mel' '--password' 'geblurrt' '--retries' '5' '--timeout' '120' '--username' 'icinga'

__name "check_redfish_mel"
active true
arguments { --authfile: { description: "Autentication file content: username= password=", value: "$redfish_authfile$" }, --critical: { description: "Critical threshold for certain checks. See documentation", value: "$redfish_critical$" }, --detailed: { description: "always print detailed result instead of a condensed one line result", set_if: "$redfish_detailed$" }, --host: { description: "hostname or address of the interface to query", required: true, value: "$host.vars.interfaces_ilo$" }, --max: { description: "maximum of returned event log entries", value: "$redfish_max$" }, --mel: { required: true }, --password: { description: "The login password", value: "$redfish_password$" }, --retries: { description: "set number of maximum retries", value: "$redfish_retries$" }, --sessionfile: { description: "Name of the session file. make sure it is unique for every host", value: "$redfish_sessionfile$" }, --sessionfiledir: { description: "Directory where the session files should be stored", value: "$redfish_sessionfiledir$" }, --timeout: { description: "set number of request timeout per try/retry", value: "$redfish_timeout$" }, --username: { description: "The login user name", value: "$redfish_username$" }, --warning: { description: "Warning threshold for certain checks. See documentation", value: "$redfish_warning$" } }
command [ "/usr/lib64/nagios/plugins/check_redfish.py" ]
env null
execute { arguments: [ "checkable", "cr", "resolvedMacros", "useResolvedMacros" ], deprecated: false, name: "Internal#PluginCheck", side_effect_free: false, type: "Function" }
ha_mode 0
name "check_redfish_mel"
original_attributes null
package "director"
paused false
timeout 60
type "CheckCommand"
vars { check_address: { arguments: [], deprecated: false, name: "", side_effect_free: false, type: "Function" }, check_ipv4: false, check_ipv6: false, redfish_bmc: true }
version 0
zone "director-global"

downtownle avatar Mar 21 '24 13:03 downtownle

Thanks but can you share the members count for the LC logs on your server?

texroemer avatar Mar 21 '24 13:03 texroemer

Hello Tex,

what exactly do you mean by the members count for the LC logs on your server?

downtownle avatar Mar 21 '24 14:03 downtownle

Can you run GET on URI "redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/[email protected]" to get this count value?

Example:

[root@localhost ~]# curl -k -X GET -u root:calvin -H "Content-Type: application/json" 'https://192.168.0.120/redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/[email protected]' --insecure | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 240 100 240 0 0 583 0 --:--:-- --:--:-- --:--:-- 583 { "@odata.context": "/redfish/v1/$metadata#LogEntryCollection.LogEntryCollection", "@odata.id": "/redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/Entries", "@odata.type": "#LogEntryCollection.LogEntryCollection", "[email protected]": 5389 }

texroemer avatar Mar 21 '24 14:03 texroemer

Hello Tex,

i added the output. p2 p3 p1

downtownle avatar Mar 22 '24 11:03 downtownle

Thanks for the details, on a server which repro the issue can you just loop check_redfish.py script from a terminal using a simple bash script and see if you can hit this issue. Would like to see if using this workflow can hit the issue (this is the workflow i used to try and repro which my server has over 5000 LC log entries, unable to hit the issue).

Also can you let me know if only one Redfish session to this iDRAC is running to pull data or are you running multiple Redfish sessions at the same time to this iDRAC?

Thanks Tex

texroemer avatar Mar 22 '24 19:03 texroemer

Hello Tex,

have you a example what you mean by that? "can you just loop check_redfish.py script from a terminal using a simple bash script and see if you can hit this issue"

downtownle avatar Mar 28 '24 16:03 downtownle

Sure, example below is a bash loop script i created which calls the python script. I just append the output to a file and then grep the file for any warning or critical errors.

root@localhost:/opt/check_redfish# cat loop.sh
#!/bin/bash

# Initialize counter
counter=1
idrac_ip=$1
idrac_username=$2
idrac_password=$3
arg_name=$4
loop_count=$5

touch loop.txt
echo > loop.txt

# While loop
while [ $counter -le $loop_count ]
do
    python3 check_redfish.py -H $idrac_ip -u $idrac_username -p $idrac_password $arg_name
    echo "- Current loop Count: $counter"
    ((counter++))
done

echo "Loop script finished"

root@localhost:/opt/check_redfish# ./loop.sh 192.168.0.120 root calvin --all 2 >> loop.txt
root@localhost:/opt/check_redfish# cat loop.txt | grep -i warning
root@localhost:/opt/check_redfish# cat loop.txt | grep -i critical
root@localhost:/opt/check_redfish#

texroemer avatar Mar 28 '24 17:03 texroemer