smartctl_exporter icon indicating copy to clipboard operation
smartctl_exporter copied to clipboard

Incorrect Temperature_Celsius

Open DiTsi opened this issue 5 years ago • 3 comments

Temperature value: 240518299684

Prometheus:

smartctl_device_attribute{attribute_flags_long="updated_online",attribute_flags_short="-O----",attribute_id="194",attribute_name="Temperature_Celsius",attribute_value_type="raw",device="/dev/sda",instance="10.99.2.2:9633",job="smartctl",model_family="Hitachi/HGST  Travelstar Z5K500",model_name="Hitachi  HTS545050A7E380",serial_number="TE95123QJTSM6V"} | 240518299684
-- | --

smartctl --json --xall /dev/sda:

      {
        "id": 194,
        "name": "Temperature_Celsius",
        "value": 166,
        "worst": 166,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 2,
          "string": "-O---- ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": false,
          "auto_keep": false
        },
        "raw": {
          "value": 240518299684,
          "string": "36 (Min/Max 2/56)"
        }
      },

Full smartctl output here

DiTsi avatar Apr 15 '21 08:04 DiTsi

@DiTsi ... looking at a particular drive via smartctl --json --xall /dev/sdh myself I see that the value indeed does not make much sense as a temperature reading. But it is simply the RAW value smartctl (and the drive firmware for that matter) does return:

[...]
      {
        "id": 194,
        "name": "Temperature_Celsius",
        "value": 181,
        "worst": 181,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 2,
          "string": "-O---- ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": false,
          "auto_keep": false
        },
        "raw": {
          "value": 176095166497,
          "string": "33 (Min/Max 23/41)"
        }
      },
[...]

or if you rather look at the table output:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  P-S---   137   137   054    -    104
  3 Spin_Up_Time            POS---   133   133   024    -    495 (Average 495)
  4 Start_Stop_Count        -O--C-   100   100   000    -    19
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         PO-R--   100   100   067    -    0
  8 Seek_Time_Performance   P-S---   140   140   020    -    15
  9 Power_On_Hours          -O--C-   095   095   000    -    39270
 10 Spin_Retry_Count        PO--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    19
192 Power-Off_Retract_Count -O--CK   099   099   000    -    1235
193 Load_Cycle_Count        -O--C-   099   099   000    -    1235
194 Temperature_Celsius     -O----   181   181   000    -    33 (Min/Max 23/41)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0

but there is the metric smartctl_device_temperature which reads from the:

[...]
"temperature": {
    "current": 33,
    "power_cycle_min": 25,
    "power_cycle_max": 34,
    "lifetime_min": 23,
    "lifetime_max": 41,
    "op_limit_min": 0,
    "op_limit_max": 60,
    "limit_min": -40,
    "limit_max": 70,
    "lifetime_over_limit_minutes": 0,
    "lifetime_under_limit_minutes": 0
  },
[...]

(see https://github.com/prometheus-community/smartctl_exporter/blob/75c76b363f6fb8454655cba5ebc4ad8089910670/smartctl.go#L211)

If you look at the manpage for smartmontools (https://github.com/smartmontools/smartmontools/blob/20d4f102744d0d8978bcad3e1c21773ef0520553/smartmontools/smartctl.8.in#L1225) they clearly state that there is conversion required and some vendors even do weird things. Please also see https://www.smartmontools.org/wiki/FAQ#Whyismydisktemperaturereportedbysmartdas150Celsius about the drive temperature.

I suppose in the end the exporter just converts what smartctl reports into metrics. Any any issues should rather be a bug reported with smartmontools at https://github.com/smartmontools/smartmontools/issues

frittentheke avatar Aug 18 '23 08:08 frittentheke

I can confirm that its still an issue. The output of smartctl --json --xall /dev/sdX is

{
        "id": 194,
        "name": "Temperature_Celsius",
        "value": 34,
        "worst": 34,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 34,
          "string": "-O---K ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": false,
          "auto_keep": true
        },
        "raw": {
          "value": 201864052770,
          "string": "34 (Min/Max 9/47)"
        }
      },

for SSDs, and

 {
        "id": 194,
        "name": "Temperature_Celsius",
        "value": 108,
        "worst": 102,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 34,
          "string": "-O---K ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": false,
          "auto_keep": true
        },
        "raw": {
          "value": 35,
          "string": "35"
        }
 } 

for HDD

easymoney322 avatar Feb 14 '24 10:02 easymoney322

Don't use smartctl_device_attribute This query is being handled by smart.mineDeviceAttribute(). Use smartctl_device_temperature instead, which is handled by smart.mineTemperatures(). It even supposed to support non-sata drives https://github.com/smartmontools/smartmontools/issues/243#issuecomment-1943871227

easymoney322 avatar Feb 15 '24 10:02 easymoney322