RFE: The systemd collector should have a timeout

Open siebenmann opened this issue 3 years ago • 1 comments

We run our node_exporters with the systemd collector enabled. We've experienced a number of incidents where systemd's dbus interface was either extremely slow or generally non-responsive, and so node_exporter's attempt to collect systemd metrics hangs. At the moment, this stops all metrics collection from node_exporter. It would be nice if the systemd collector had an (optional?) timeout, so that failure to talk to systemd over dbus would merely make it fail instead of causing all host metrics to be unavailable.

The current code in collectors/systemd_linux.go does pass in a context on relevant operations, but the context is context.TODO(). It might be a relatively simple change to add an optional timeout command line argument and then use that to create a different context, although I'm not sure what specifically needs the new context. It might be sufficient to set it up in newSystemdDbusConn() and perhaps leave the other contexts alone.

Jan 26 '23 18:01 siebenmann

I'd be open to that if we could do it in a way that we can use for places where we already have timeouts, like the mountpoint stat timeout: https://github.com/prometheus/node_exporter/blob/c914f0052629e3c99449bdfb4fa7189ce09e77b5/collector/filesystem_linux.go#L40

@SuperQ wdyt?

Mar 07 '23 12:03 discordianfish