Server Experiences Periodic High Resource Usage and Timeouts
Problem: The server consumes excessive resources (CPU/Memory) and begins to lag/experience performance degradation.
Manifestation:
When trying to connect, I get a timeout error.
However, existing connections continue to work correctly.
When it occurs: At various times, after a certain period has passed since the server was restarted.
Workaround: I have to manually restart the server.
Expected behavior: The complete absence of such issues.
Additional context: Is there any way to fix this? Charts attached.
Thanks for the detailed report. To help us investigate, could you provide some more details?
- How many users does the server typically serve?
- Do you notice any specific events or user actions that coincide with the periods of high resource consumption? For example, are there surges in user count?
Let me also involve our server expert @fortuna , could this be related to our Prometheus data collection?
@supermetrolog Can you take a look at the Prometheus Metrics? That should give us more insights into what's going on.
You can see resources per process, number of connections, errors, etc.
i have similar issue when I try to create a new key by API, the CPU and disk usage gets very high and then the API doesn't respond anymore. I have about 140 keys on my server.
Sorry for the late reply.
Unfortunately, there are no metrics from the exact moment of the problems I described in the issue. That's because I set up a cron job a while ago that deletes all Prometheus metrics once a week, because server degradation was also happening without it.
But there are still connection issues now, there are timeouts when accessing the server via API.
The current metrics are a bit different; I just see very high CPU usage. Before, the I/O load was increasing, but now it's the "Other Process" category. Actually, it's Shadowsocks consuming the CPU – I checked it using the top utility.
It's very strange that there are so many active keys because my connection wrapper is designed to delete the previous client key on every new connection. And I definitely don't have that many clients, there are 20 people at most.
I've noticed that even though the key deletion method is called, the key is not actually being removed. So the number of keys should remain the same after each reconnection, but it's increasing.
Maybe this is the cause of the degradation, but it's strange, because it's only 655 keys, not 655 connections. I don't know what the problem is with handling 655 keys...
These are my node's personal metrics:
And then here are the Shadowsocks server metrics:
Here, I stopped the shadowbox container and then started it again. While it was off, the state returned to normal and the load dropped. When I started it back up, there was a spike.
Although at that moment, a couple of clients at most could have connected using static keys. The number of keys did not increase.