nut icon indicating copy to clipboard operation
nut copied to clipboard

nut-2.8.2 does not seem to honor DEADTIME

Open avg-I opened this issue 1 year ago • 5 comments

I have a single UPS and several computers powered off it. One is a master that runs upsd and upsmon, others are slaves that run just upsmon. Here is a snippet from upsmon.conf on one of the slaves:

MONITOR [email protected] 1  xxx xxx slave
POLLFREQ 15
POLLFREQALERT 5
DEADTIME 120

Recently, during a blackout, I had a glitch where the network interface on that slave went down for about 4 seconds. Its upsmon started powering off the machine in about 5 seconds, which was not ideal in the situation.

Here are some logs:

May 23 05:55:50 super kernel: lagg0: link state changed to DOWN
May 23 05:55:54 super kernel: lagg0: link state changed to UP
May 23 05:55:55 super upsmon[1078]: Poll UPS [[email protected]] failed - Server disconnected
May 23 05:55:55 super upsmon[1078]: Communications with UPS [email protected] lost
May 23 05:55:55 super upsmon[1078]: UPS [[email protected]] was last known to be not fully online and currently is not communicating, assuming dead
May 23 05:55:55 super upsmon[1078]: Executing automatic power-fail shutdown
May 23 05:55:55 super upsmon[1078]: Auto logout and shutdown proceeding

I expected that upsmon would wait for DEADTIME before doing that.

What additional information should I provide?

avg-I avatar May 23 '24 06:05 avg-I

I suppose "link down/up" transitions broke the TCP session, so the upsmon client was forcefully disconnected from the upsd data server while in a critical state, and behaved by design.

jimklimov avatar May 23 '24 07:05 jimklimov

So, any communication problem between upsd and upsmon while on battery, and upsmon is supposed to immediately start powering off?

Then, what DEADTIME is for?

# DEADTIME - Interval to wait before declaring a stale ups "dead"
# 
# upsmon requires a UPS to provide status information every few seconds
# (see POLLFREQ and POLLFREQALERT) to keep things updated.  If the status
# fetch fails, the UPS is marked stale.  If it stays stale for more than
# DEADTIME seconds, the UPS is marked dead.
# 
# A dead UPS that was last known to be on battery is assumed to have gone
# to a low battery condition.  This may force a shutdown if it is providing
# a critical amount of power to your system.

Is that applicable only to a local (serial or USB connected) UPS? Is there any control like that for network communication?

avg-I avatar May 23 '24 08:05 avg-I

The data server regularly updates the connected clients like upsmon with broadcasts about device information. For your corner case, "connected" is the critical word. Link flickered, IP address probably disappeared for a few seconds, TCP session got broken, server is assumed abruptly powered off (and/or its OS went down without waiting for clients to disconnect, so its upsd is off). And since the UPS was last known to be on battery, we haven't got much more time to reconnect or investigate either. To keep data safe, gotta run to stop services, flush filesystems ASAP.

This seems similar to the documented example with networking gear turning off because it is not on an UPS (or a weaker one) and that being among the reasons for emergency shutdown of a client. Here your lack of network just did not have some switch or router disappearing.

jimklimov avatar May 24 '24 00:05 jimklimov

I see your point.

At the same time, the UPS was not really critical, it was on battery but not low battery.

It would be nice if users had some control over the behavior. Immediate shutdown on any glitch is not suitable for all. In some scenarios a UPS is used just to give enough time for an orderly shutdown. But in other scenarios people want to keep services running for as long as possible (e.g., with regularly scheduled blackouts).

We give the master server DEADTIME to restore communications with a UPS device. But we do not give slaves any time to restore communications with the master. Seems like an omission.

avg-I avatar May 24 '24 05:05 avg-I

Fair point, at least for the non-critical OB state. Would you care to post a PR for the new toggle?

For a bit more context about the current/default behavior, note however that as an UPS or its batteries age, the original assumptions of what would comprise an actual critical state can become obsolete (part of why some devices offer calibration functionality). So based on invalid assumptions we can think there's a lot of juice in the battery, while in fact the UPS is a glorified power strip or close to that.

jimklimov avatar May 24 '24 07:05 jimklimov

@jimklimov, I created #2462. Not sure if that matches your idea on how the issue should be resolved. In my opinion, going back to the traditional behavior is the best solution.

avg-I avatar Jun 05 '24 11:06 avg-I

Primary PR merged. Exploratory one (to return the log message) left out for now, per discussion. Maybe will come back to it for debug-only logging just in case, though.

jimklimov avatar Jun 10 '24 10:06 jimklimov

Just reading up on this some more after being out and about for a while. I agree a criticality should not be triggered by a minor communication staleness even when OB, respecting DEADTIME regardless of linestate seems a wise decision here. The default value of 15 seconds should be short enough not to cause any major crises if the UPS is in fact dead, but also long enough not to bring down a server prematurely due to some minor, short-lived communication hiccups.

In any case, I'm always happy with user-configurability where it makes sense (and especially for shutdown criteria), and instant criticality might have indeed been a bit too strict here in retrospect (but with good intentions nonetheless).

desertwitch avatar Jun 10 '24 16:06 desertwitch