trinityX icon indicating copy to clipboard operation
trinityX copied to clipboard

Luna fail after adding nodes

Open chiensh opened this issue 4 months ago • 8 comments

We have encountered an issue with Luna during the migration of over 100 nodes to Trinity.

When attempting to reset the power using the lpower reset or luna control power reset commands, Luna hangs and displays the following error:

Image

It takes 5 to 10 minutes for Luna to recover.

Additionally, the luna node list command does not accurately reflect the updated status for some nodes (e.g., power reset), even when the power reset command is successfully executed.

chiensh avatar Sep 28 '25 08:09 chiensh

It seems that it was due to the loading of prometheus-infiniband-exporter being too high (keep over 100%) and caused the controller too busy to respond.

I have to turn off prometheus-infiniband-exporter on the controller to take back the control of luna

chiensh avatar Oct 03 '25 17:10 chiensh

Disabling the prometheus-infiniband-exporter service alone can help alleviate the issue, but it does not fully resolve the problem. The luna command continues to hang during large-scale system reboots (e.g., for operating system updates) involving hundreds of nodes. We need to do this quite often during system testing and migration from the old manager to TrinityX, and luna has been disappointing with this unexpected behaviour.

For example, when a power reset command is issued for nodes [001-200] and a large number of nodes begin shutting down, the load on Prometheus running on the controller spikes dramatically—ranging from 500% to 1200% on a 32 virtual core controller. This excessive load causes the luna command to become unresponsive, which in turn prevents some nodes from shutting down properly or being redeployed successfully.

We believe this situation can be improved once Prometheus is fully and correctly configured, but this configuration can only be implemented after the system is fully tested and operational. In the meantime, to ensure successful shutdowns and re-provisioning, the following services need to be stopped to restore luna functionality:

prometheus-infiniband-exporter.service prometheus-node-exporter.service prometheus-lshw-exporter.service prometheus-server.service prometheus-alertmanager.service

chiensh avatar Oct 07 '25 04:10 chiensh

We shut down the whole system using "luna control power off " and "lpower off" commands ;luna node list does not update the latest status of the server even the commands have been issued for multiple times.

We have confirmed the nodes have been power off, but node list does not update the latest status

Image

chiensh avatar Oct 08 '25 04:10 chiensh

I am frustrated with the performance of Luna when we attempt to do large-scale power up or power down.

I suspected the sluggish performance was due to Prometheus and K3s, so I surpassed their maximum load and memory, but the problem recurs whenever I power on/off the whole system - Luna will stop responding.

Image

From htop, we can see that the loading is not particularly high, but luna has failed to respond

Image

It seems that luna start to stop respond durning system booting once gunicorn start to have heavy work loadings: Image

On the other hand, there are times when Luna becomes responsive as soon as we take down the Prometheus server.

I still don’t have a conclusive answer to determine what exactly triggers Luna to hang.

chiensh avatar Oct 10 '25 04:10 chiensh

Quite a thread... normally I'd say, please reach out to us where we can login remotely to see what's happening. It would reduce the time needed to solve problems and we can learn a thing or two. Win/win. Is this something to consider? But for now, first things first.

Not sure what's happening with the IB exporter. We deploy large installations (1000+ nodes) where we do not see this effect. How many nodes you have? In this case I'd be very curious to see the logs of the IB exporter of the head node or any other node where it renders this load. Also the logs of prometheus server would help.

Also, since luna seems affected, the logs of the luna daemon. You could also send (in addition) the file produced by lsosreport, however I wouldn't publish it here. Is it possible to upload it somewhere for us to download?

You mention you're running a virtual controller, which in itself is fine, but make sure you provide enough i/o or iops as a disk. Maybe some bottleneck is there?

In regards to scaling luna, though you might not have that problem here, we typically tune gunicorn.conf to serve more threads for large installs. 8 to 16 are not out of the norm, but should not exceed the nr of cores on the server. N-4 or so should be ok.

-- Antoine

aphmschonewille avatar Oct 10 '25 19:10 aphmschonewille

Thank you Antoine,

We will be discuss on how ClusterVision can help and access to our system later this week.

Now, I believe the IB exporter may not be the root cause.

As a quick test, we increase the cpu core for Gunicorn from 9 to 31, it seems to be a bit helpful to the situation (luna come back from time to time duning booting, but it still suck for up to few minutes)

chiensh avatar Oct 11 '25 01:10 chiensh

Does it mean the nodes are provision through http instead of BT? this may explain why the controller being so busy when the node is booting .

Image

Something that may be related: AlertX on OOD is so slow and fail for most of the time, Image

it show up an error like this at one point Image

chiensh avatar Oct 13 '25 05:10 chiensh

Yes. It does seem like it. Not sure why BT wouldn't work though.

Let's do one step back. What if you just boot 10 nodes (the others can be kept off or as is if they're already up), does that work as expected? This including BT and all other bells and whistles?

Happy to see though that AlertX does its job. Yes, the nr of tcp connections exceeds what we deem healthy (this includes a grace period of 10min?), so it's a good catch!

aphmschonewille avatar Oct 14 '25 20:10 aphmschonewille