ClusterManagers.jl icon indicating copy to clipboard operation
ClusterManagers.jl copied to clipboard

Handling of busy LSF deamon

Open DrChainsaw opened this issue 4 years ago • 4 comments

I sometimes get the following failure when adding LSF workers:

ClusterManagers.LSFException("LSF daemon (LIM) not responding ... still trying")

I'm not sure if this message is the same on all systems, but if it is then maybe its worth adding to the set of expected responses here.

DrChainsaw avatar Oct 07 '21 09:10 DrChainsaw

i have not seen that error message. what version of LSF are you using? mine is:

$ lsid
IBM Spectrum LSF Standard 10.1.0.9, Oct 16 2019
Copyright International Business Machines Corp. 1992, 2016.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

bjarthur avatar Oct 07 '21 11:10 bjarthur

This is mine:

$ lsid
IBM Spectrum LSF Standard 10.1.0.10, Apr 10 2020

Just to clarify: This message comes every now and then as a result of any lsf command (e.g. bsub, bjobs, bhosts etc) and prints every few seconds. After a while (max a few minutes) the lsf command is successfully executed.

It might be some (potentially home-made) overload protection. I will ask my admins about it.

DrChainsaw avatar Oct 07 '21 12:10 DrChainsaw

Admins responded that the message most likely appears due to restarts in conjunction with reconfiguration of the server.

DrChainsaw avatar Oct 07 '21 13:10 DrChainsaw

Maybe one could use this list to sort out the non-fatal messages? I'm pretty sure I have seen the LSF is processing your request. Please wait… message as well.

DrChainsaw avatar Oct 07 '21 13:10 DrChainsaw