bee Node hang/crashes after a while and keeps burning on CPU.

Bug description

After couple of hours bee-node seems to crash but the threads are still running and all the CPUs are on 100%. Another maybe relevant observation is that when the node ran for couple of hours its impossible to kill a bit more gracefully with -15. I have to kill it with -9 and delete the storage to get it running again. This is required regardless if the node already got the hang issue or is still running regularly.

Rust version

cargo 1.52.0 (69767412a 2021-04-21)
release: 1.52.0
commit-hash: 69767412acbf7f64773427b1fb53e45296712c3c
commit-date: 2021-04-21

Bee version

issue appeared after a day or two a2931c6

far more frequent like after a couple of hours f708093 (most recent chrysalis-pt-2 branch)

Hardware specification

Operating system: Debian GNU/Linux 10 (buster)
RAM: 16GB
Cores: 4
Device: root server

Steps To reproduce the bug

Keep the node build with dashboard running for a while. Maybe its relevant that I currently run it with nohup.

Expected behaviour

Not dying

Actual behaviour

Dying

Errors

This is the information I was able to get with level_filter = error This log is cut since it goes on for a while with just different indexes. The stout log reports constant reconnecting to to peers and 0 messages in/out

2021-05-23 12:55:56 bee_protocol::workers::solidifier          [91mERROR[0m Requested milestone 0 message id not present in the tangle.
2021-05-23 12:55:56 bee_node::plugins::mqtt::manager           [91mERROR[0m Disconnecting mqtt broker failed: PahoDescr(-3, "Client disconnected").
2021-05-23 12:55:56 bee_node::plugins::mqtt                    [91mERROR[0m Creating mqtt manager failed Mqtt(PahoDescr(-1, "TCP/TLS connect failure")).
2021-05-23 12:58:17 bee_protocol::workers::message::payload::transaction [91mERROR[0m Missing or invalid payload for message b4487bc2a53d8094a3c1939ff99e731dca5497067f2694124755a984b2096cfc: expected indexation payload.
2021-05-23 12:59:26 bee_protocol::workers::message::payload::transaction [91mERROR[0m Missing or invalid payload for message 4479f006fcf80498000aac1108d5d28269275864d97bf5857217892391508a61: expected indexation payload.
2021-05-23 13:00:28 bee_protocol::workers::message::payload::transaction [91mERROR[0m Missing or invalid payload for message cc0825419b8ccb1135839c23fd8ab6d913547877d2882b0a0a61ea001e65aee0: expected indexation payload.
2021-05-23 13:03:52 bee_protocol::workers::message::payload::transaction [91mERROR[0m Missing or invalid payload for message 9904cb5eaeb50bbdba8e84487830ceeada279833903bb70712d0e22696c79527: expected indexation payload.
2021-05-23 13:12:51 bee_protocol::workers::message::payload::transaction [91mERROR[0m Missing or invalid payload for message d180beeb9c7008b0bb78bb2cafa078b737f219a51f8e32503162a2ede02a2e12: expected indexation payload.
2021-05-23 13:15:22 bee_protocol::workers::message::payload::transaction [91mERROR[0m Missing or invalid payload for message 89a6f6a54d80540578b81a3ba28e8e84c75bdfcdcea57834c08cb87deafb8d32: expected indexation payload.
2021-05-23 14:03:02 warp::reject                               [91mERROR[0m unhandled custom rejection, returning 500 response: Forbidden 
2021-05-23 14:03:03 warp::reject                               [91mERROR[0m unhandled custom rejection, returning 500 response: Forbidden 
2021-05-23 14:03:07 warp::reject                               [91mERROR[0m unhandled custom rejection, returning 500 response: Forbidden 
2021-05-23 14:31:08 warp::reject                               [91mERROR[0m unhandled custom rejection, returning 500 response: Forbidden

May 23 '21 23:05 dennym

Hi, thanks for the report. I actually encountered the same issue, I'm investigating.

May 25 '21 07:05 thibault-martinez

Is there any news on this issue, yet? I've also come across this.

Nov 28 '21 15:11 DuncanConroy

@DuncanConroy Is there any information you can add by any chance ? Logs, machine specs, anything that can help investigating.

Nov 29 '21 23:11 thibault-martinez

debug.log I've attached the log on debug level from today. I "believe" the CPU issues appear after the MQTT manager failed (not sure what it's used for, though). It only took a few minutes until the issues appeared I do have another log on trace level, but it's a couple hundred MB of size.

Hardware specification

Operating system: Windows 11 RAM: 64GB Cores: 8 (16 logical) Device: Desktop PC

Please let me know if there's anything I can do to help. I'm familiar with Rust, but not with the project :)

Nov 30 '21 16:11 DuncanConroy

I just wanted to add that I've had a similar issue with a private tokio based application, where the receiver of an mpsc was closed prematurely, while the sending half still tried to send messages. This also burnt CPU and led to unresponsiveness at some point. Maybe it's related and a hint for further investigation.

Dec 08 '21 09:12 DuncanConroy

The MQTT manager should be fixed with #828.

Dec 08 '21 10:12 grtlr