Node hang/crashes after a while and keeps burning on CPU.
Bug description
After couple of hours bee-node seems to crash but the threads are still running and all the CPUs are on 100%.
Another maybe relevant observation is that when the node ran for couple of hours its impossible to kill a bit more gracefully with -15. I have to kill it with -9 and delete the storage to get it running again. This is required regardless if the node already got the hang issue or is still running regularly.
Rust version
cargo 1.52.0 (69767412a 2021-04-21)
release: 1.52.0
commit-hash: 69767412acbf7f64773427b1fb53e45296712c3c
commit-date: 2021-04-21
Bee version
issue appeared after a day or two a2931c6
far more frequent like after a couple of hours f708093 (most recent chrysalis-pt-2 branch)
Hardware specification
- Operating system: Debian GNU/Linux 10 (buster)
- RAM: 16GB
- Cores: 4
- Device: root server
Steps To reproduce the bug
Keep the node build with dashboard running for a while. Maybe its relevant that I currently run it with nohup.
Expected behaviour
Not dying
Actual behaviour
Dying
Errors
This is the information I was able to get with level_filter = error
This log is cut since it goes on for a while with just different indexes.
The stout log reports constant reconnecting to to peers and 0 messages in/out
2021-05-23 12:55:56 bee_protocol::workers::solidifier [91mERROR[0m Requested milestone 0 message id not present in the tangle.
2021-05-23 12:55:56 bee_node::plugins::mqtt::manager [91mERROR[0m Disconnecting mqtt broker failed: PahoDescr(-3, "Client disconnected").
2021-05-23 12:55:56 bee_node::plugins::mqtt [91mERROR[0m Creating mqtt manager failed Mqtt(PahoDescr(-1, "TCP/TLS connect failure")).
2021-05-23 12:58:17 bee_protocol::workers::message::payload::transaction [91mERROR[0m Missing or invalid payload for message b4487bc2a53d8094a3c1939ff99e731dca5497067f2694124755a984b2096cfc: expected indexation payload.
2021-05-23 12:59:26 bee_protocol::workers::message::payload::transaction [91mERROR[0m Missing or invalid payload for message 4479f006fcf80498000aac1108d5d28269275864d97bf5857217892391508a61: expected indexation payload.
2021-05-23 13:00:28 bee_protocol::workers::message::payload::transaction [91mERROR[0m Missing or invalid payload for message cc0825419b8ccb1135839c23fd8ab6d913547877d2882b0a0a61ea001e65aee0: expected indexation payload.
2021-05-23 13:03:52 bee_protocol::workers::message::payload::transaction [91mERROR[0m Missing or invalid payload for message 9904cb5eaeb50bbdba8e84487830ceeada279833903bb70712d0e22696c79527: expected indexation payload.
2021-05-23 13:12:51 bee_protocol::workers::message::payload::transaction [91mERROR[0m Missing or invalid payload for message d180beeb9c7008b0bb78bb2cafa078b737f219a51f8e32503162a2ede02a2e12: expected indexation payload.
2021-05-23 13:15:22 bee_protocol::workers::message::payload::transaction [91mERROR[0m Missing or invalid payload for message 89a6f6a54d80540578b81a3ba28e8e84c75bdfcdcea57834c08cb87deafb8d32: expected indexation payload.
2021-05-23 14:03:02 warp::reject [91mERROR[0m unhandled custom rejection, returning 500 response: Forbidden
2021-05-23 14:03:03 warp::reject [91mERROR[0m unhandled custom rejection, returning 500 response: Forbidden
2021-05-23 14:03:07 warp::reject [91mERROR[0m unhandled custom rejection, returning 500 response: Forbidden
2021-05-23 14:31:08 warp::reject [91mERROR[0m unhandled custom rejection, returning 500 response: Forbidden
Hi, thanks for the report. I actually encountered the same issue, I'm investigating.
Is there any news on this issue, yet? I've also come across this.
@DuncanConroy Is there any information you can add by any chance ? Logs, machine specs, anything that can help investigating.
debug.log I've attached the log on debug level from today. I "believe" the CPU issues appear after the MQTT manager failed (not sure what it's used for, though). It only took a few minutes until the issues appeared I do have another log on trace level, but it's a couple hundred MB of size.
Hardware specification
Operating system: Windows 11 RAM: 64GB Cores: 8 (16 logical) Device: Desktop PC
Please let me know if there's anything I can do to help. I'm familiar with Rust, but not with the project :)
I just wanted to add that I've had a similar issue with a private tokio based application, where the receiver of an mpsc was closed prematurely, while the sending half still tried to send messages. This also burnt CPU and led to unresponsiveness at some point. Maybe it's related and a hint for further investigation.
The MQTT manager should be fixed with #828.