outline-server icon indicating copy to clipboard operation
outline-server copied to clipboard

[Bug]: High (Prometheus) CPU usage for new unified metrics server endpoint [v1.12.0]

Open mohammad051 opened this issue 1 year ago • 28 comments

Application

Outline Manager

Describe the bug

hello Today, Outline Manager was automatically updated to version 1.17.0. All server resources like ram - cpu It became 100%

When I reboot the server, the server is fixed, but when I use the outline manager, I want to open the management key, it gets full again and the server crashes.

How can I disable the automatic update of Outline Manager and use the previous version to solve the problem of the new version?

Steps to reproduce

1.Open the Outline Manager

What did you expect to happen?

No response

What actually happened?

No response

Outline Version

1.17.0

What operation system are you using?

Windows

Operating System Version

No response

Screenshots and Videos

No response

mohammad051 avatar Feb 21 '25 11:02 mohammad051

Yeah, bro! The same problem. It drives me crazy. All of my Outline servers went to 100% of CPU utilization. Firstly I thought it was a server problem.

I already dropped a ticket to Outline support.

Luckily I've had a backup of Outline Manager on my flash drive and it works well.

Waiting for the solution.

BossyBigBoss avatar Feb 21 '25 12:02 BossyBigBoss

Yeah, bro! The same problem. It drives me crazy. All of my Outline servers went to 100% of CPU utilization. Firstly I thought it was a server problem.

I already dropped a ticket to Outline support.

Luckily I've had a backup of Outline Manager on my flash drive and it works well.

Waiting for the solution.

My brother version 1.14.0 I installed it on the system, but it updates automatically, what version did you install?

mohammad051 avatar Feb 21 '25 14:02 mohammad051

Thanks for the report. We're looking into it. If you can share more details of what you're experiencing, please feel free to share them in here.

sbruens avatar Feb 21 '25 15:02 sbruens

Thanks for the report. We're looking into it. If you can share more details of what you're experiencing, please feel free to share them in here.

Outline Manager was automatically updated today 30 of my servers are down The servers they use are lightsail from Amazon

I thought it was a server problem, but it wasn't.

I rebooted the server and saw that it came up.

After opening the Outline Manager, I saw that the CPU was at 100%. And the server had a problem and did not come up.

Please check and fix the problem. I really don't know what to do to find a solution.

thank you

mohammad051 avatar Feb 21 '25 15:02 mohammad051

Thanks @mohammad051. We introduced some new metrics in the Manager UI, the calculation of which I assume is the cause of this high CPU. For my understanding, how many access keys do your servers roughly have?

sbruens avatar Feb 21 '25 15:02 sbruens

Thanks @mohammad051. We introduced some new metrics in the Manager UI, the calculation of which I assume is the cause of this high CPU. For my understanding, how many access keys do your servers roughly have?

Brother, each of my servers has between 30 and 60 active keys. I didn't have this problem on the previous version. How can I use the previous version without automatically updating this manager?

mohammad051 avatar Feb 21 '25 15:02 mohammad051

Thanks @mohammad051. We introduced some new metrics in the Manager UI, the calculation of which I assume is the cause of this high CPU. For my understanding, how many access keys do your servers roughly have?

I have the same issues on Amazon servers. The servers have 40-50 active access keys. Even if I close Outline Manager my Outline servers remain under 100% CPU utilization. Only reboot helps.

BossyBigBoss avatar Feb 21 '25 16:02 BossyBigBoss

Yeah, bro! The same problem. It drives me crazy. All of my Outline servers went to 100% of CPU utilization. Firstly I thought it was a server problem. I already dropped a ticket to Outline support. Luckily I've had a backup of Outline Manager on my flash drive and it works well. Waiting for the solution.

My brother version 1.14.0 I installed it on the system, but it updates automatically, what version did you install?

I have a backup of the previous version of Outline Manager for Windows 1.15.2 I disabled WiFi on my PC to prevent a new version update, ran the backup version, got "Server Unreachable" and then enabled WiFi and clicked Retry to connect to the Outline server.

BossyBigBoss avatar Feb 21 '25 16:02 BossyBigBoss

An old Manager is a workaround, but we completed a rollback of the server back to v1.11.0. Your servers should pick up this change within the hour, when watchtower looks for a new image to pull.

The continued CPU usage is surprising, that implies something is still doing work despite the Manager not asking anything. Can I ask whether you are also experiencing memory issues?

sbruens avatar Feb 21 '25 17:02 sbruens

An old Manager is a workaround, but we completed a rollback of the server back to v1.11.0. Your servers should pick up this change within the hour, when watchtower looks for a new image to pull.

The continued CPU usage is surprising, that implies something is still doing work despite the Manager not asking anything. Can I ask whether you are also experiencing memory issues?

After opening the Outline Manager menu, it goes up quickly and we don't even have a chance to log in and we don't know that the memory is involved.

I went to the old version of Outline Manager but it immediately updates to the new version and the problems start.

How can I disable automatic updates to fix the problem?

please help me

mohammad051 avatar Feb 21 '25 17:02 mohammad051

It's not a Manager issue; it's a server issue, which we rolled back earlier. Are you saying this is still happening for servers running the rolled back v1.11.0 version?

sbruens avatar Feb 21 '25 17:02 sbruens

It's not a Manager issue; it's a server issue, which we rolled back earlier. Are you saying this is still happening for servers running the rolled back v1.11.0 version?

thank you brother Now it came to version 1.11.0 And all the problems were solved

thank you very much

mohammad051 avatar Feb 21 '25 18:02 mohammad051

Thank you for confirming @mohammad051 and I'm glad to hear that resolved the immediate outage. I'm just going to move this over to the server repo so we can track the work to fix this over there.

sbruens avatar Feb 21 '25 18:02 sbruens

Thank you for confirming @mohammad051 and I'm glad to hear that resolved the immediate outage. I'm just going to move this over to the server repo so we can track the work to fix this over there.

Thank you very much for your help.

You helped me a lot.

Thank you for your quick answers.

Thank you for solving this problem in the shortest possible time.

mohammad051 avatar Feb 21 '25 19:02 mohammad051

We have spent some more time on this and confirmed that Prometheus can cause increased CPU issues. d262f5242f5385d41578af9a47de68e31b83d5ad is mitigating the issue, though we still need to examine a full root cause.

If anyone that ran into this issue is able to do a test run with the new release candidate containing the hotfix, that would help give us more confidence before releasing to a wider audience. New release candidate image:

quay.io/outline/shadowbox:v1.12.2-rc2

sbruens avatar Feb 23 '25 14:02 sbruens

sbruens

We have spent some more time on this and confirmed that Prometheus can cause increased CPU issues. d262f52 is mitigating the issue, though we still need to examine a full root cause.

If anyone that ran into this issue is able to do a test run with the new release candidate containing the hotfix, that would help give us more confidence before releasing to a wider audience. New release candidate image:

quay.io/outline/shadowbox:v1.12.2-rc2

I would like to say a big thank you. Unfortunately, I can't do an experiment on my servers cause it's a live Outline server for my users.
Too risky.

BossyBigBoss avatar Feb 24 '25 12:02 BossyBigBoss

Dear @sbruens

We have spent some more time on this and confirmed that Prometheus can cause increased CPU issues. d262f52 is mitigating the issue, though we still need to examine a full root cause.

If anyone that ran into this issue is able to do a test run with the new release candidate containing the hotfix, that would help give us more confidence before releasing to a wider audience. New release candidate image:

quay.io/outline/shadowbox:v1.12.2-rc2

Today the problem gets back.
Once I opened Outline Manager for Windows (the newest version) it made CPU utilization up to 100% No problem with Outline Manager for Windows 1.15.2

I've attached a screenshot from Amazon AWS to illustrate it.

Image

I have 3 Outline servers.

The problem occurs with 2 of them based on Ubuntu 20.04 The third server is based on Ubuntu 24.04 and there is no such problem.

BossyBigBoss avatar Feb 26 '25 09:02 BossyBigBoss

Thanks @BossyBigBoss for letting us know.

Very helpful to know it may be an Ubuntu 20.04 issue. Is there anything else that differs between the servers beyond Ubuntu version? Are you able to check what Docker image each of them are running?

sbruens avatar Feb 26 '25 15:02 sbruens

Thanks @BossyBigBoss for letting us know.

Very helpful to know it may be an Ubuntu 20.04 issue. Is there anything else that differs between the servers beyond Ubuntu version? Are you able to check what Docker image each of them are running?

Hello, good time Thank you for your quick reply

I can provide you with a server from Amazon.

to do all the tests on it. If you tell me how to send you a server.

thank you

mohammad051 avatar Feb 26 '25 16:02 mohammad051

@sbruens

Image

Thanks @BossyBigBoss for letting us know.

Very helpful to know it may be an Ubuntu 20.04 issue. Is there anything else that differs between the servers beyond Ubuntu version? Are you able to check what Docker image each of them are running?

Can you please tell me how to check that? I used chatgpt and according to its answer, I checked on Ubuntu 20.04

On Ubuntu 24.04 the same result.

BossyBigBoss avatar Feb 27 '25 08:02 BossyBigBoss

Thanks @BossyBigBoss, ChatGPT was right: that was indeed the answer I was looking for. It confirms they are all on the same (latest) image. Are the specs of your Lightsail instances the same for all 3 servers? By that I mean CPU, RAM, etc. You should be able to find this information in the overview:

Image

I can provide you with a server from Amazon.

to do all the tests on it. If you tell me how to send you a server.

Thanks for the offer, but there's no need. But I am having no luck reproducing the issue. I spun up some Lightsail instances yesterday with what I think is the smallest blueprint: 512 MB RAM, 2 vCPUs, 20 GB SSD. I used Ubuntu 20.04, but it's not spiking in CPU the way you're experiencing. Not even when I try and hit the endpoint directly rapidly and consistently.

Can I confirm that @BossyBigBoss and @mohammad051 you are both only seeing this on AWS Ubuntu 20.04 machines, is that correct?

Some other debugging suggestions:

  • You could use htop to investigate the processes more directly. That will tell us (and perhaps confirm) whether it's Prometheus or some other service that's the culprit at this point. Here are some instructions: https://support.cloudways.com/en/articles/5120765-how-to-monitor-system-processes-using-htop-command

  • Look at the docker logs for the Shadowbox container: docker logs shadowbox. See also https://docs.docker.com/reference/cli/docker/container/logs/

I know it's not necessarily a fix, but are you able to upgrade the instances to Ubuntu 24.04, as recommended by Amazon? Ubuntu 20.04 is coming up to end of support in May.

sbruens avatar Feb 27 '25 19:02 sbruens

@sbruens Thank you for the assistance.

I am using Amazon servers with the next specs: 1 GB RAM, 2 vCPUs, 40 GB SSD The problem with CPU overloading occurs on Ubuntu 20.04

htop command can't be used as the server is inaccessible because of CPU overload. docker logs shadowbox shows only log since the server was rebooted.

BossyBigBoss avatar Feb 28 '25 07:02 BossyBigBoss

@sbruens It seems that the issue is related to Metrics.

In the new version, Metrics analysis is performed within the server, which leads to high CPU load at this time. I think it would be better if the data were collected from the server by Outline Manager, and the analysis was performed on the system where Outline Manager is running, rather than on the server itself.

Currently, this bug still exists and causes the main server to crash. Even after the CPU returns to normal, the server's performance remains very poor.

Please make the server and Outline Manager updates optional so that when the system is in a stable state, it remains in that condition.

The same issue has occurred with Outline clients as well. The previous versions must be completely removed and reinstalled. With the update, users experience frequent disconnections.

SadeghSalman avatar Mar 04 '25 17:03 SadeghSalman

Guys, how to save users' access keys and import them to a new installation? I am tired of this bug and forced to use an old Outline Manager as a copy.

BossyBigBoss avatar Mar 05 '25 12:03 BossyBigBoss

We're still struggling to reproduce this issue, especially with the added cache layer https://github.com/Jigsaw-Code/outline-server/pull/1643. https://github.com/Jigsaw-Code/outline-server/pull/1646 is also in-flight to further reduce load, but if anyone is able to help debug this, it could help us understand why it's happening on some servers but not others.

If additional reporting users can please add details about the servers where they are seeing this, it can help us understand what server setups are affected and if there is a specific pattern. Any logs or additional investigations on your server will also be valuable.

@BossyBigBoss you can use the management API to export and import keys. Someone also documented an alternative way to do it in https://github.com/Jigsaw-Code/outline-apps/issues/1905 which may be easier to do.

sbruens avatar Mar 05 '25 19:03 sbruens

Guys, how to save users' access keys and import them to a new installation? I am tired of this bug and forced to use an old Outline Manager as a copy.

Find this on the source server.

/opt/outline

Transfer the entire folder to the destination server.

scp -r /opt/outline [email protected]:/opt/outline

Restart the destination server.

SadeghSalman avatar Mar 07 '25 04:03 SadeghSalman

@sbruens

I have a suggestion. Since this high CPU usage is for metric calculations, I think we should write a script that deletes previous data except for the consumed volume and updates the shape of Prometheus data. This way, the issue will be resolved.

The components of this script can be flexible, but in my opinion, writing it in Python would be better since it can run on all operating systems.

I suspect that the new version of Prometheus is causing this issue, and updating the Outline server also updates Prometheus, which disrupts the data.

SadeghSalman avatar Mar 07 '25 10:03 SadeghSalman

@SadeghSalman

Guys, how to save users' Outline access keys and import them to a new installation? I am tired of this bug and forced to use an old Outline Manager as a copy.

Find this on the source server.

/opt/outline

Transfer the entire folder to the destination server.

scp -r /opt/outline [email protected]:/opt/outline

Restart the destination server.

Thank you very much, I will try. It seems upgrading the server to the newest Ubuntu (in my case) is the only solution.

BossyBigBoss avatar Mar 07 '25 18:03 BossyBigBoss