HUGE MEM SPIKES / ejabberd_system_monitor:handle_overload : The system is overloaded with
Environment
- ejabberd version: ejabberd 23.04
- Erlang version: Erlang/OTP 22 [erts-10.4.4] [source] [64-bit]
- OS: Linux (Almalinux8) Intel Xeon x64
- Installed from: source
Configuration (only if needed): grep -Ev '^$|^\s*#' ejabberd.yml
loglevel: 4
shaper:
normal:
rate: 3000
burst_size: 6000
fast: 100000
shaper_rules:
max_user_sessions: 10
max_user_offline_messages:
5000: xxxx
100: all
c2s_shaper:
none: **xxxx**
normal: all
s2s_shaper: fast
auth_method: sql
host_config:
"xxxxxx":
extauth_program: "xxxxxxxxxx"
auth_method: [external]
auth_use_cache: false
sql_pool_size: 6
sql_keepalive_interval: 150
listen:
-
port: xxxx
module: ejabberd_c2s
max_stanza_size: 262144
shaper: c2s_shaper
access: c2s
-
port: xxxx
module: ejabberd_http
tls: true
-
modules:
mod_avatar:
mod_bosh:
max_inactivity: 150
mod_caps: {}
mod_carboncopy: {}
mod_client_state:
mod_mam:
mod_offline:
mod_ping: {}
mod_pubsub:
mod_push: {}
mod_push_keepalive: {}
mod_roster:
mod_shared_roster:
mod_stats: {}
mod_stream_mgmt:
ps.- Obvious omissions on config for security purposes.
Errors from error.log/crash.log
9 hours after dawn restart with a relatively small internal userbase which functions mainly with pubsub as a "logged in or not" status indicator :
2024-02-21 15:41:45.536850+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **1072045** messages queued by 7 process(es) (1%) from the following applications: ejabberd (**1072045** msgs); the top processes are:
** <0.22333.0>: msgs = 404701, memory = 1592112560, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22332.0>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22392.0>: msgs = 383314, memory = 2501936152, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22391.0>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22343.0>: msgs = 257521, memory = 1070150064, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22342.0>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22412.0>: msgs = 26502, memory = 424769504, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22411.0>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22332.0>: msgs = 3, memory = 2584558100, initial_call = ejabberd_bosh:init/1, current_function = lists:-filter/2-lc$^0/1-0-/2, ancestors = [mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
2024-02-21 15:41:45.537348+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 3 process(es) consuming more than 153149 message(s) each
Bug description
After upgrading our server to an absurd amount of memory, we've recently been debugging our server performance and noticed some HUGE spikes on memory usage (30GB+ spikes on resident mem, 60GB+ virtual) coming from beam.smp/ejabberd whereas its average resident size remains steadily below ~400Mb before and after the spikes.
PID TID MINFLT MAJFLT VSTEXT VSLIBS VDATA VSTACK LOCKSZ VSIZE RSIZE PSIZE VGROW RGROW SWAPSZ RUID EUID MEM CMD 1/72
1862 - 7672e4 0 3.6M 9.6M 58.0G 132.0K 0.0K 62.9G 36.7G 0B 50.4G 31.1G 0B ejabberd ejabberd 39% beam.smp
So, after looking at some logs we came upon these:
024-02-20 22:55:00.738664+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **1967163** messages queued by 10 process(es) (2%) from the following applications: ejabberd (**1967163** msgs); the top processes are:
** <0.5005.3>: msgs = 819617, memory = 6331179968, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.5004.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.4927.3>: msgs = 694653, memory = 4190023404, initial_call = xmpp_stream_in:init/1, current_function = p1_server:collect_messages/3, ancestors = [<0.4926.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.5029.3>: msgs = 452852, memory = 3192091880, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.5028.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.4989.3>: msgs = 30, memory = 690304, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.4988.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.4909.3>: msgs = 3, memory = 3101465796, initial_call = ejabberd_bosh:init/1, current_function = mnesia_lib:db_get/3, ancestors = [mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
2024-02-20 22:55:00.739087+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 3 process(es) consuming more than 196716 message(s) each
2024-02-20 22:56:21.989115+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.5028.3> consumes more than 5% of OS memory (msgs = 0, memory = 6431197880, initial_call = ejabberd_bosh:init/1, current_function = gen:do_call/4, ancestors = [mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd)
The only other possible relative information from that log is that it was filled with : Closing c2s session for ... Stream reset by peer.
2024-02-20 22:47:52.086816+00:00 [info] <0.4826.3>@ejabberd_c2s:process_auth_result/3:272 (tcp|<0.4826.3>) Accepted c2s PLAIN authentication for xxxx@yyyyy by external backend from customip
2024-02-20 22:47:52.087641+00:00 [info] <0.4826.3>@ejabberd_c2s:bind/2:450 (tcp|<0.4826.3>) Opened c2s session for rxxxx@yyyyy/zzzz
2024-02-20 22:47:52.103140+00:00 [info] <0.4826.3>@ejabberd_c2s:process_terminated/2:295 (tcp|<0.4826.3>) Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer
(Could it be that those "stream reset by peer" are actually accumulating as undelivered messages? If so, how to bypass that given it is definetely not something that can be handled on the client side as this is a known issue :: can we forego those messages and have'm purged somehow?)
In order to clean this up a bit, we cleared our logs and restarted the server yesterday at dawn to obtain a cleaner shorter log .. here's a clean copy of the past 12hours (as you can see only a few lines shown.. it was dawn afterall yet the problem still shows) ;
FULL LOG (not too big I promise) :
2024-02-21 04:13:03.210833+00:00 [info] <0.110.0>@ejabberd_config:load/1:82 Loading configuration from /......../ejabberd.yml
2024-02-21 04:13:03.235022+00:00 [notice] <0.110.0>@ejabberd_logger:set/1:374 Changing loglevel from 'info' to 'warning'
2024-02-21 05:58:55.592409+00:00 [warning] <0.1755.0>@ejabberd_c2s:process_auth_result/3:280 (http_bind|<0.1754.0>) Failed c2s HHHHHHHH authentication for r2c4e_8740f7e2a@customname from 79.116.194.11: Invalid username or password
2024-02-21 06:00:50.374004+00:00 [warning] <0.1775.0>@ejabberd_c2s:process_auth_result/3:280 (tcp|<0.1775.0>) Failed c2s HHHHHHHH authentication for r2c4e_8740f7e2a@customname from customip: Invalid username or password
--- MORE LINES WITH :: Failed c2s & Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer ---
2024-02-21 11:17:55.749593+00:00 [warning] <0.10425.0>@ejabberd_c2s:process_auth_result/3:280 (tcp|<0.10425.0>) Failed c2s HHHHHHHH authentication for r2c4e_06bb92905@customname from customip: Invalid username or password
2024-02-21 11:41:17.019494+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **33137** messages queued by 3 process(es) (1%) from the following applications: ejabberd (**33137** msgs); the top processes are:
** <0.11261.0>: msgs = 33134, memory = 233377736, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.11260.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.11196.0>: msgs = 2, memory = 34592, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.11195.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.11195.0>: msgs = 1, memory = 1038675412, initial_call = ejabberd_bosh:init/1, current_function = lists:keysearch/3, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
2024-02-21 11:41:17.019824+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 1 process(es) consuming more than 11046 message(s) each
2024-02-21 12:36:39.426190+00:00 [warning] <0.14383.0>@ejabberd_c2s:process_auth_result/3:280 (http_bind|<0.14382.0>) Failed c2s HHHHHHHH authentication for r2c2a_8ddc25571@customname from 92.9.103.72: Invalid username or password
2024-02-21 12:59:25.546390+00:00 [warning] <0.15533.0> unexpected event from XML driver: {xmlstreamelement,
{xmlel,<<"presence">>,
[{<<"xmlns">>,<<"jabber:client">>}],
[]}}; xmlstreamstart was expected
2024-02-21 13:35:12.294509+00:00 [warning] <0.17141.0> unexpected event from XML driver: {xmlstreamelement,
{xmlel,<<"presence">>,
[{<<"xmlns">>,<<"jabber:client">>}],
[]}}; xmlstreamstart was expected
2024-02-21 14:18:45.365339+00:00 [warning] <0.18747.0>@ejabberd_c2s:process_auth_result/3:280 (http_bind|<0.18746.0>) Failed c2s HHHHHHHH authentication for r2ce4_20b991a74@customname from 178.139.173.139: Invalid username or password
--- MORE LINES WITH :: Failed c2s & Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer ---
2024-02-21 14:29:25.458126+00:00 [warning] <0.19138.0>@ejabberd_c2s:process_auth_result/3:280 (tcp|<0.19138.0>) Failed c2s HHHHHHHH authentication for r2c4e_48518f402@customname from customip: Invalid username or password
2024-02-21 15:41:45.536850+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **1072045** messages queued by 7 process(es) (1%) from the following applications: ejabberd (**1072045** msgs); the top processes are:
** <0.22333.0>: msgs = 404701, memory = 1592112560, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22332.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22392.0>: msgs = 383314, memory = 2501936152, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22391.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22343.0>: msgs = 257521, memory = 1070150064, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22342.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22412.0>: msgs = 26502, memory = 424769504, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22411.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22332.0>: msgs = 3, memory = 2584558100, initial_call = ejabberd_bosh:init/1, current_function = lists:-filter/2-lc$^0/1-0-/2, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
2024-02-21 15:41:45.537348+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 3 process(es) consuming more than 153149 message(s) each
2024-02-21 15:42:07.958236+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.22412.0> consumes more than 5% of OS memory (msgs = 762077, memory = 5504763272, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22411.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd)
2024-02-21 15:42:07.968733+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **762293** messages queued by 3 process(es) (1%) from the following applications: ejabberd (**762293** msgs); the top processes are:
** <0.22412.0>: msgs = 762288, memory = 5505483600, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22411.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22332.0>: msgs = 3, memory = 2584555140, initial_call = ejabberd_bosh:init/1, current_function = mnesia:dirty_index_read/3, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22342.0>: msgs = 2, memory = 3721758692, initial_call = ejabberd_bosh:init/1, current_function = mnesia:val/1, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
2024-02-21 15:42:07.968996+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 1 process(es) consuming more than 254098 message(s) each
2024-02-21 15:45:08.022471+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.22411.0> consumes more than 5% of OS memory (msgs = 0, memory = 5359331744, initial_call = ejabberd_bosh:init/1, current_function = gen:do_call/4, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd)
2024-02-21 15:50:10.641543+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.22411.0> consumes more than 5% of OS memory (msgs = 0, memory = 5359331744, initial_call = ejabberd_bosh:init/1, current_function = gen:do_call/4, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd)
--- MORE LINES WITH :: Failed c2s & Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer ---
More info :
Please notice 👀 Where do 1+ MILLION messages come from? how can I see a sample of these "messages" ? (Could it be that those "stream reset by peer & failed authentications" are actually accumulating as undelivered messages? If so, can we forego those messages and have'm purged somehow?)
SQL Shows no "last or archive" nor any other relevant numbers.
Webadmin shows:
Statistics of ejabberd@localhost
Uptime: | 47402.294-- | -- | 2024-02-21 04:13:02
CPU Time: | 1520.227
Online Users: | 63
Transactions Committed: | 10
Transactions Aborted: | 6
Transactions Restarted: | 0
Transactions Logged: | 4
Database Tables at ejabberd@localhost
Name Elements Memory
archive_msg 0 43,712
archive_prefs 0 43,712
bosh 56 10,952
bytestream 0 2,440
caps_features 0 43,712
ejabberd_commands 82 189,376
last_activity 0 43,712
mod_register_ip 0 2,440
motd 0 43,712
motd_users 0 43,712
mqtt_pub 0 43,712
mqtt_session 0 2,440
mqtt_sub 0 1,112
muc_online_room 0 1,112
muc_registered 0 2,440
muc_room 0 2,440
oauth_client 0 2,440
oauth_token 0 43,712
offline_msg 0 43,712
passwd 0 43,712
privacy 0 43,712
private_storage 0 43,712
pubsub_index 0 2,440
pubsub_item 0 43,712
pubsub_last_item 0 2,440
pubsub_node 0 2,440
pubsub_orphan 0 2,440
pubsub_state 0 1,112
push_session 0 43,712
reg_users_counter 0 2,440
roster 0 43,712
roster_version 0 43,712
route 6 3,512
route_multicast 0 2,440
s2s 0 2,440
schema 43 45,856
session 56 33,920
session_counter 0 2,440
sr_group 0 2,440
sr_user 0 2,440
temporarily_blocked 0 2,440
vcard 0 43,712
vcard_search 0 2,440
Any help would be greatly appreciated on debugging what is going on and why is this happening.
ps.- Also, there's no correlation with failed/closed sessions or any burst on network activity.
Where does mod_bosh_r2webusers come from?
I guess r2webusers is a host that they use.
First thing, is your client able to use websocket instead of bosh? If so please change it to use websocket, bosh is slower and uses significantly more resources.
To diagnose this you could use ejabberd debug console. When you start it you can see processes that have big queues by executing p1_prof:q(). in it (that dot at end is significant, don't skip it). This will list 10 processes with biggest queues. To see what messages are queued, copy pid(..., ..., ...) from that previous command (it will have three numbers), and execute erlang:process_info(pid(... ,... , ...), messages). with that pid.
I am guessing those messages will be xmpp packets that need to be delivered to client, but bosh connection is just to slow to deliver them at rate those arrive.