telegraf icon indicating copy to clipboard operation
telegraf copied to clipboard

Smart plugging request new metric "smart_device_power_status"

Open EcceGratum opened this issue 3 years ago • 4 comments

Use Case

This is an extension of #9306.

I would like a new metric "smart_device_power_status" which would return the power state of the drives. Currently, we have a basic info (on/off) in the label "power" of metric "smart_device_exit_status" but i have two issues with that:

  1. It's difficult to make a non buggy grafana timeseries with that
  2. It only reports 2 states "Active" or "Standby" but there is a lot of inbetween power states that "smartctl" can report (depends on drive)

As far as i know, smartctl reports "ACTIVE or IDLE", "IDLE_A", "IDLE_B", "IDLE_C", "STANDBY". I know there is also "Standby_Y" & "Standby_Z" but i don't know if smartctl report them or just uses "STANDBY" instead. Some are more usefull than others, like "IDLE_B", the disk parks the heads.

Expected behavior

smart_device_power_status{device="sda",enabled="",host="92889644d4c0",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 5 smart_device_power_status{device="sdb",enabled="",host="92889644d4c0",model="",power="IDLE_B",serial_no="",user="$USER",wwn=""} 2 smart_device_power_status{device="sdc",enabled="",host="92889644d4c0",model="",power="UNKNOWN",serial_no="",user="$USER",wwn=""} -1

Actual behavior

smart_device_exit_status{capacity="",device="sde",enabled="",host="92889644d4c0",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2 smart_device_exit_status{capacity="2000398934016",device="sdd",enabled="Enabled",host="92889644d4c0",model="SAMSUNG HD203WI",power="ACTIVE",serial_no="",user="$USER",wwn=""} 0

Additional info

[[inputs.smart]] use_sudo = true nocheck = "standby" devices = [ "hostfs/dev/sda -d ata", "hostfs/dev/sdb -d ata", "hostfs/dev/sdc -d ata", "hostfs/dev/sdd -d ata", "hostfs/dev/sde -d ata", "hostfs/dev/sdf -d ata"]

EcceGratum avatar Dec 18 '22 16:12 EcceGratum

Can you explain the actual and expected behaviour metrics in influx line format please? That will make it more clear what exactly you are requesting..

Hipska avatar Dec 21 '22 17:12 Hipska

I am not familiar with that format as i don't use influxdb but should be something like this: smart_device_power_status,harddrive=/dev/sda status=5 1465839830100400200 or smart_device_power_status,harddrive=/dev/sda status="STANDBY" 1465839830100400200

The S.M.A.R.T. plugin already exports the data to prometheus as: smart_device_exit_status{capacity="",device="sde",enabled="",host="92889644d4c0",model="",power="STANDBY"

power needs to be it's own metric and support the intermediary states, not just ACTIVE or STANDBY. Something like this: smart_device_power_status{device="sda",host="92889644d4c0"} 5

EcceGratum avatar Dec 22 '22 02:12 EcceGratum

The example output from the smart plugin is like this (according to the docs)

smart_device,enabled=Enabled,host=mbpro.local,device=rdisk0,model=APPLE\ SSD\ SM0512F,serial_no=S1K5NYCD964433,wwn=5002538655584d30,capacity=500277790720 udma_crc_errors=0i,exit_status=0i,health_ok=true,read_error_rate=0i,temp_c=40i 1502536854000000000

I can see a field exit_status and I assume you also want a field power_status? If I can read your prometheus metric correctly, there should also already be a tag power? I can't find that in this current example, so it would help if you paste your current output in influx line format (by using file output for example) and also the output of the corresponding smartctl tool as given in the docs.

Hipska avatar Dec 22 '22 08:12 Hipska

"I can see a field exit_status and I assume you also want a field power_status?" Yes, "exit_status" is just the returned value when executing the smartctl command.

"If I can read your prometheus metric correctly, there should also already be a tag power?" Yes, seems to have been added in #9306 but i think it's more of a workaround to know if a drive is spinned down. It's probably based on the value of "exit_status". If your smartctl command starts with "smartctl --nocheck=standby" and the "exit_status" is 2, the drive is spinned down, if it returns 0, it's not, which is better than nothing but we don't see the intermediary power states.

With the file output plugin, all the smart related info in influx line format: smart_device,device=sdd,host=745557e0062c,power=STANDBY,user=$USER exit_status=2i 1671988565000000000 smart_device,device=sda,host=745557e0062c,power=STANDBY,user=$USER exit_status=2i 1671988565000000000 smart_device,capacity=500107862016,device=sdf,enabled=Enabled,host=745557e0062c,model=Samsung\ SSD\ 860\ EVO\ 500GB,power=ACTIVE,serial_no=***************,user=$USER,wwn=5002538e497dfcc4 uncorrectable_errors=0i,temp_c=24i,udma_crc_errors=0i,exit_status=0i,health_ok=true,reallocated_sectors_count=0i,wear_leveling_count=96i 1671988565000000000 smart_device,device=sdb,host=745557e0062c,power=STANDBY,user=$USER exit_status=2i 1671988565000000000 smart_device,device=sde,host=745557e0062c,power=STANDBY,user=$USER exit_status=2i 1671988565000000000 smart_device,capacity=4000787030016,device=sdc,enabled=Enabled,host=745557e0062c,model=ST4000VN008-2DR166,power=ACTIVE,serial_no=********,user=$USER,wwn=5000c5009de12d0b reallocated_sectors_count=0i,spin_retry_count=0i,command_timeout=7i,pending_sector_count=0i,uncorrectable_sector_count=0i,health_ok=true,read_error_rate=6300578i,seek_error_rate=5037987897i,end_to_end_error=0i,uncorrectable_errors=0i,temp_c=18i,udma_crc_errors=12i,exit_status=0i 1671988565000000000

EcceGratum avatar Dec 25 '22 17:12 EcceGratum

It only reports 2 states "Active" or "Standby" but there is a lot of inbetween power states that "smartctl" can report (depends on drive)

Based on this whitepaper the different states are primarily associated with spinning disks.

We currently grab the existing power and standby mode by looking at the output here and then use those discovered values here to set the power tag.

power needs to be it's own metric

Can you share why you think this and why it cannot be another field that parses the power state in more detail?

I am hesitant to modify the smart plugin any further given how fragile the regular expression parsing is.

powersj avatar Jan 05 '23 18:01 powersj

"Based on this whitepaper the different states are primarily associated with spinning disks." Yes, these are intermediary states between fully active and fully spinned down. Some states indicate head parking and/or slower drive RPM.

In my custom exporter, i get the results from "Device is in " not "Power mode". I am not familiar with the output values of "Power mode". I could look into it if you need.

The parsing of "Device is in" & "Power mode" seems already correct, except the code in "smart.go" uses "Power mode" (which may or may not return the intermediary states) and doesn't care about what was parsed only checks that something was parsed. I would guess that if device is in standby, parsing of "Power mode" would return an empty string.

"power needs to be it's own metric" That comment only applies to what is exposed from Telegraf to Prometheus. I am guessing that there is some kind of translation layer. Inside Telegraf, the "smart_device,device=sdd,host=745557e0062c,power=STANDBY,user=$USER exit_status=2i 1671988565000000000" can be reused to describe the others power states, i have no opinion on the matter.

The reason i want power status to be it's on metric in what is exposed to Prometheus is that currently, if i use the : smart_device_exit_status{capacity="",device="sde",enabled="",host="92889644d4c0",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2 in a time serie in Grafana 9, i will get duplicate lines when a transition between power states happen. It seems to only happen on that time serie.

In this Grafana dashboard, we can see that the drives transitioning are consider "Active" and on "Standby" during the transition period, which is odd. 51 And after that it's fine but i get duplicates with no status for the drives that transitioned. 50

EcceGratum avatar Jan 09 '23 02:01 EcceGratum

Oh, that last part is just a matter of modifying your query in Grafana, or change the current power tag to a field with the converter processor.

Hipska avatar Jan 09 '23 11:01 Hipska

Tried a few things, none worked but maybe due to my inexperience with grafana (including a "transform" "labels to fields").

When the issue happens, this is what the scraped data from telegraf to prometheus looks like:

smart_device_exit_status{capacity="",device="sda",enabled="",host="ed101bf7913e",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2
smart_device_exit_status{capacity="",device="sdc",enabled="",host="ed101bf7913e",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2
smart_device_exit_status{capacity="",device="sdd",enabled="",host="ed101bf7913e",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2
smart_device_exit_status{capacity="",device="sde",enabled="",host="ed101bf7913e",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2
smart_device_exit_status{capacity="",device="sdf",enabled="",host="ed101bf7913e",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2
smart_device_exit_status{capacity="2000398934016",device="sdf",enabled="Enabled",host="ed101bf7913e",model="SAMSUNG HD203WI",power="ACTIVE",serial_no="S1UYJ1CZ700536",user="$USER",wwn="50024e9003bdf2f2"} 0
smart_device_exit_status{capacity="500107862016",device="sdb",enabled="Enabled",host="ed101bf7913e",model="Samsung SSD 860 EVO 500GB",power="ACTIVE",serial_no="S4XBNF0M714935D",user="$USER",wwn="5002538e497dfcc4"} 0

The device "sdf" appears twice for a few seconds when the device transitions (happens with any device). Not the behaviour i would expect or see with other fields.

I will try to use the converter processor but i don't think it will fix that.

But this issue is about getting the other power states.

EcceGratum avatar Jan 11 '23 03:01 EcceGratum

About the other power states, please provide us the smart command and the output of such a different state, so someone can implement this.

About those ‘duplicates’, please provide them in influx line format as the Prometheus format also doesn’t give a timestamp. I’m still convinced this is a matter of doing a correct query in Grafana.

Hipska avatar Jan 11 '23 05:01 Hipska

The command is "smartctl --nocheck=standby /dev/sda". The outputs can be "Device is in ACTIVE or IDLE mode", "Device is in IDLE_A mode".

This other command can also be used "smartctl -i --nocheck=standby /dev/sda". One of the lines of the output is "Power mode is: ACTIVE or IDLE", "Power mode was: IDLE_B", "Power mode was: IDLE_A", etc... This is information i found with google as i currently can't use smartmontools 7.3 .

I recently moved from windows to linux and noticed that on linux, smartmontools never returns the intermediary power states (IDLE_B, etc...) but on windows it does (on the same drives).

On linux, i use smartmontools 7.2 (release 2020-12-30) . On windows, it was probably 7.3 (release 2022-02-28). I suspect the brand / model of the drive may also impact this (Seagate works, WD ???) .

I will try to find a way to use the latest version and give you the full output of the commands.

EcceGratum avatar Jan 16 '23 02:01 EcceGratum

I compiled smartmontools 7.3 and used my script and also called smartctl manually but for some reason, i don't get the intermediary power states... only "active or idle" or "standby".

Just to make sure that the code was not specific to windows, i checked smartmontools 7.3 sources.

For ATA devices, the power mode is requested in file "ataprint.cpp" at line 3337. The returned int value goes into a switch to select the proper string for the power modes. Here is the list supported for ATA devices (there is a file for the SCSI devices) : "SLEEP", "STANDBY", "STANDBY_Y", "IDLE", "IDLE_A", "IDLE_B", "IDLE_C", "ACTIVE_NV_DOWN", "ACTIVE_NV_UP", "ACTIVE or IDLE"

This string is printed to the console at line 3383 (drive is in standby), 3466 (drive is active and "-n" alone was used) or 3707 (i suspect this line to be with "-i -n") in the same file : l3383: jinf("Device is in %s mode, exit(%d)\n", powername, options.powerexit); l3466: pout("Device is in %s mode\n", powername); l3707: pout("Power mode %s %s\n", (powerchg?"was:":"is: "), powername);

with "-n standby" you can get only up to "STANDBY" & "STANDBY_Y". with "-n idle" you can get up to "IDLE", "IDLE_A", "IDLE_B", "IDLE_C".

By using the Seagate CLI i can force my drives to go into "idle_b" and it appears as such in grafana with my python scripts that relies on "smartctl" on Linux. So the intermediary states are also reported on linux, for some reason my drives never go in to IDLE_A & IDLE_B on linux.

Edit 1: Ok, my drives finally go into intermediary states. I use "-n idle" in telegraf & my custom python script.

"sudo smartctl -n idle /dev/sdc" returns : "smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-58-generic] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Device is in IDLE_B mode, exit(2)"

"sudo smartctl -i -n idle /dev/sdc" returns : "smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-58-generic] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Device is in IDLE_B mode, exit(2)"

image

image

EcceGratum avatar Jan 18 '23 03:01 EcceGratum

sudo smartctl -i -n idle /dev/sdc

ok, but this is not what telegraf runs. It should run something to the effect of the following (can't recall off hand how it translates hostfs/dev/sde -d ata or if it uses it raw:

sudo smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sde

Can you get the full output and see if you find 'Idle_B' in that output?

powersj avatar Jan 18 '23 14:01 powersj

If Telegraf uses "-n standby" and polls every < 10mins, the disks will never go into IDLE modes.

If Telegraf is disable and you wait until the disk goes into IDLE mode and then run the command : "sudo smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sda"

you get this : image

The drive WAS in IDLE mode and is immediately sent into ACTIVE mode.

In order to get the IDLE modes without forcing the drives into active mode, "-n idle" is a requirement. When running the command when the drive is in IDLE mode : "sudo smartctl --info --health --attributes --tolerance=verypermissive -n idle --format=brief /dev/sda"

image

most of the info is unavailable with the extra parameters and it falls back to the shortest output. The drives remain in IDLE mode.

With "-n idle", very few information is available. The power mode is most likely the only info available. Since there was a "nocheck" option in the config file, i supposed Telegraf already did some reading with it set to "idle". With your current implementation, seems like a chore to add this.

EcceGratum avatar Jan 18 '23 23:01 EcceGratum

If Telegraf uses "-n standby" and polls every < 10mins, the disks will never go into IDLE modes.

Is that because when telegraf calls smartctl will cause the disks to spin and never enter an idle state?

In order to get the IDLE modes without forcing the drives into active mode, "-n idle" is a requirement.

The purpose of the -n/--nocheck flag is to set what power states smartctl will use in order to prevent smartctl from spinning up the disks.

  • sleep - check the device, which will cause the disks to spin, but skip if the device is in sleep state
  • standby - the same, but skip if the device in sleep or standby states
  • idle - the same, but skip if the device in sleep, standby, or idle states

My conclusion reading your last post is that it is not possible to get these states from the devices you have since telegraf will always cause the disks to spin and as such not let the device go idle? Is that correct?

powersj avatar Jan 19 '23 00:01 powersj

"Is that because when telegraf calls smartctl will cause the disks to spin and never enter an idle state ?" Yes with "-n standby". No with "-n idle".

"My conclusion reading your last post is that it is not possible to get these states from the devices you have since telegraf will always cause the disks to spin and as such not let the device go idle? Is that correct?" If Telegraf uses "-n standby", yes the drives will never go into the intermediary idle states.

Does that mean that the "nocheck" option in "telegraf.conf" is not used in the command ?

EcceGratum avatar Jan 19 '23 00:01 EcceGratum

Your original issue shows you used a nocheck of standby. I assume you have tried with idle? What output do you get from that?

Does that mean that the "nocheck" option in "telegraf.conf" is not used in the command ?

The value of nocheck provided by the user is set and used here.

powersj avatar Jan 19 '23 14:01 powersj

I can confirm Telegraf correctly parses the power state with option nocheck set to idle.

But, if the drive is in active state and Telegraf is running, the drive will not enter Idle_b or Idle_c. If the drive is already in Idle_b / idle_c and Telegraf is started, the drive will remain in it's idle state and the power state will be correctly parsed.

When the drive is in active mode, something in Telegraf if preventing it to go into idle mode.

Also the timeserie with the query "smart_device_exit_status" in Grafana shows a bit of a mess with the device used for the test sdc but we can see that the power mode is correctly parsed.

image

NB : I use a docker container. devices = [ "hostfs/dev/sda -d ata", "hostfs/dev/sdb -d ata", "hostfs/dev/sdc -d ata", "hostfs/dev/sdd -d ata", "hostfs/dev/sde -d ata", "hostfs/dev/sdf -d ata", "hostfs/dev/sdg -d ata"]

EcceGratum avatar Jan 19 '23 23:01 EcceGratum

I can confirm Telegraf correctly parses the power state with option nocheck set to idle.

Awesome

But, if the drive is in active state and Telegraf is running, the drive will not enter Idle_b or Idle_c.

That is what I would expect. Recall my comment above about smartctl's nocheck option. It checks what power states to not spin up the drives. If no check is set to 'idle' then smartctl will only spin up the drives when the drive is in active state. All other states it will not spin up the disk.

You have a disk in an active state, smartctl looks and says ok I can spin the drives to get stats, and so it will. Unless your interval on telegraf is set to something > than the time it takes for the device to go back into idle, you disk will never go idle.

At this point I think we have shown that telegraf can in fact report those idle values and hopefully this explains what is going on with smartctl.

powersj avatar Jan 20 '23 14:01 powersj

I suspect Telegraf may prevent the drives from going into Idle modes with "-n idle" because of the extra attributes in the query ("--health --attributes --tolerance=verypermissive") when the drive is active. My custom script doesn't prevent the active drives from going into idle modes but it's queries are must simpler, "smartctl --nocheck=idle" for the power mode & "smartctl --nocheck=idle -l scttempsts" for the temps but perhaps it's something else.

Anyway, it seems it would need more than a few tweaks in the code to make it work in Telegraf. I will close this issue if your are ok with it.

Thank you for you time.

EcceGratum avatar Jan 20 '23 23:01 EcceGratum