ejabberd icon indicating copy to clipboard operation
ejabberd copied to clipboard

HUGE MEM SPIKES / ejabberd_system_monitor:handle_overload : The system is overloaded with

Open nstiac opened this issue 1 year ago • 2 comments

Environment

  • ejabberd version: ejabberd 23.04
  • Erlang version: Erlang/OTP 22 [erts-10.4.4] [source] [64-bit]
  • OS: Linux (Almalinux8) Intel Xeon x64
  • Installed from: source

Configuration (only if needed): grep -Ev '^$|^\s*#' ejabberd.yml

loglevel: 4
shaper:
  normal:
    rate: 3000
    burst_size: 6000
  fast: 100000
shaper_rules:
  max_user_sessions: 10
  max_user_offline_messages:
    5000: xxxx
    100: all
  c2s_shaper:
    none: **xxxx**
    normal: all
  s2s_shaper: fast
auth_method: sql
host_config:
  "xxxxxx":
    extauth_program: "xxxxxxxxxx"
    auth_method: [external]
    auth_use_cache: false
sql_pool_size: 6
sql_keepalive_interval: 150
listen:
  -
    port: xxxx
    module: ejabberd_c2s
    max_stanza_size: 262144
    shaper: c2s_shaper
    access: c2s
  -
    port: xxxx
    module: ejabberd_http
    tls: true
  -
modules:
  mod_avatar:
  mod_bosh:
    max_inactivity: 150
  mod_caps: {}
  mod_carboncopy: {}
  mod_client_state:
  mod_mam:
  mod_offline:
  mod_ping: {}
  mod_pubsub:
  mod_push: {}
  mod_push_keepalive: {}
  mod_roster:
  mod_shared_roster:
  mod_stats: {}
  mod_stream_mgmt:

ps.- Obvious omissions on config for security purposes.

Errors from error.log/crash.log

9 hours after dawn restart with a relatively small internal userbase which functions mainly with pubsub as a "logged in or not" status indicator :

2024-02-21 15:41:45.536850+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **1072045** messages queued by 7 process(es) (1%) from the following applications: ejabberd (**1072045** msgs); the top processes are:
** <0.22333.0>: msgs = 404701, memory = 1592112560, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22332.0>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22392.0>: msgs = 383314, memory = 2501936152, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22391.0>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22343.0>: msgs = 257521, memory = 1070150064, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22342.0>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22412.0>: msgs = 26502, memory = 424769504, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22411.0>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22332.0>: msgs = 3, memory = 2584558100, initial_call = ejabberd_bosh:init/1, current_function = lists:-filter/2-lc$^0/1-0-/2, ancestors = [mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd

2024-02-21 15:41:45.537348+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 3 process(es) consuming more than 153149 message(s) each

Bug description

After upgrading our server to an absurd amount of memory, we've recently been debugging our server performance and noticed some HUGE spikes on memory usage (30GB+ spikes on resident mem, 60GB+ virtual) coming from beam.smp/ejabberd whereas its average resident size remains steadily below ~400Mb before and after the spikes.

PID          TID      MINFLT      MAJFLT      VSTEXT      VSLIBS       VDATA      VSTACK      LOCKSZ       VSIZE        RSIZE       PSIZE       VGROW       RGROW      SWAPSZ      RUID          EUID            MEM      CMD        1/72
1862            -      7672e4           0        3.6M        9.6M       58.0G      132.0K        0.0K       62.9G        36.7G          0B       50.4G       31.1G          0B      ejabberd      ejabberd        39%      beam.smp

So, after looking at some logs we came upon these:

024-02-20 22:55:00.738664+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **1967163** messages queued by 10 process(es) (2%) from the following applications: ejabberd (**1967163** msgs); the top processes are:
** <0.5005.3>: msgs = 819617, memory = 6331179968, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.5004.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.4927.3>: msgs = 694653, memory = 4190023404, initial_call = xmpp_stream_in:init/1, current_function = p1_server:collect_messages/3, ancestors = [<0.4926.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.5029.3>: msgs = 452852, memory = 3192091880, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.5028.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.4989.3>: msgs = 30, memory = 690304, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.4988.3>,mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.4909.3>: msgs = 3, memory = 3101465796, initial_call = ejabberd_bosh:init/1, current_function = mnesia_lib:db_get/3, ancestors = [mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd

2024-02-20 22:55:00.739087+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 3 process(es) consuming more than 196716 message(s) each

2024-02-20 22:56:21.989115+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.5028.3> consumes more than 5% of OS memory (msgs = 0, memory = 6431197880, initial_call = ejabberd_bosh:init/1, current_function = gen:do_call/4, ancestors = [mod_bosh_r2webusers,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd)

The only other possible relative information from that log is that it was filled with : Closing c2s session for ... Stream reset by peer.

2024-02-20 22:47:52.086816+00:00 [info] <0.4826.3>@ejabberd_c2s:process_auth_result/3:272 (tcp|<0.4826.3>) Accepted c2s PLAIN authentication for xxxx@yyyyy by external backend from customip
2024-02-20 22:47:52.087641+00:00 [info] <0.4826.3>@ejabberd_c2s:bind/2:450 (tcp|<0.4826.3>) Opened c2s session for rxxxx@yyyyy/zzzz
2024-02-20 22:47:52.103140+00:00 [info] <0.4826.3>@ejabberd_c2s:process_terminated/2:295 (tcp|<0.4826.3>) Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer

(Could it be that those "stream reset by peer" are actually accumulating as undelivered messages? If so, how to bypass that given it is definetely not something that can be handled on the client side as this is a known issue :: can we forego those messages and have'm purged somehow?)


In order to clean this up a bit, we cleared our logs and restarted the server yesterday at dawn to obtain a cleaner shorter log .. here's a clean copy of the past 12hours (as you can see only a few lines shown.. it was dawn afterall yet the problem still shows) ;

FULL LOG (not too big I promise) :

2024-02-21 04:13:03.210833+00:00 [info] <0.110.0>@ejabberd_config:load/1:82 Loading configuration from /......../ejabberd.yml
2024-02-21 04:13:03.235022+00:00 [notice] <0.110.0>@ejabberd_logger:set/1:374 Changing loglevel from 'info' to 'warning'
2024-02-21 05:58:55.592409+00:00 [warning] <0.1755.0>@ejabberd_c2s:process_auth_result/3:280 (http_bind|<0.1754.0>) Failed c2s HHHHHHHH authentication for r2c4e_8740f7e2a@customname from 79.116.194.11: Invalid username or password
2024-02-21 06:00:50.374004+00:00 [warning] <0.1775.0>@ejabberd_c2s:process_auth_result/3:280 (tcp|<0.1775.0>) Failed c2s HHHHHHHH authentication for r2c4e_8740f7e2a@customname from customip: Invalid username or password

--- MORE LINES WITH :: Failed c2s & Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer ---

2024-02-21 11:17:55.749593+00:00 [warning] <0.10425.0>@ejabberd_c2s:process_auth_result/3:280 (tcp|<0.10425.0>) Failed c2s HHHHHHHH authentication for r2c4e_06bb92905@customname from customip: Invalid username or password
2024-02-21 11:41:17.019494+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **33137** messages queued by 3 process(es) (1%) from the following applications: ejabberd (**33137** msgs); the top processes are:
** <0.11261.0>: msgs = 33134, memory = 233377736, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.11260.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.11196.0>: msgs = 2, memory = 34592, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.11195.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.11195.0>: msgs = 1, memory = 1038675412, initial_call = ejabberd_bosh:init/1, current_function = lists:keysearch/3, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd

2024-02-21 11:41:17.019824+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 1 process(es) consuming more than 11046 message(s) each

2024-02-21 12:36:39.426190+00:00 [warning] <0.14383.0>@ejabberd_c2s:process_auth_result/3:280 (http_bind|<0.14382.0>) Failed c2s HHHHHHHH authentication for r2c2a_8ddc25571@customname from 92.9.103.72: Invalid username or password
2024-02-21 12:59:25.546390+00:00 [warning] <0.15533.0> unexpected event from XML driver: {xmlstreamelement,
                                   {xmlel,<<"presence">>,
                                    [{<<"xmlns">>,<<"jabber:client">>}],
                                    []}}; xmlstreamstart was expected
2024-02-21 13:35:12.294509+00:00 [warning] <0.17141.0> unexpected event from XML driver: {xmlstreamelement,
                                   {xmlel,<<"presence">>,
                                    [{<<"xmlns">>,<<"jabber:client">>}],
                                    []}}; xmlstreamstart was expected
2024-02-21 14:18:45.365339+00:00 [warning] <0.18747.0>@ejabberd_c2s:process_auth_result/3:280 (http_bind|<0.18746.0>) Failed c2s HHHHHHHH authentication for r2ce4_20b991a74@customname from 178.139.173.139: Invalid username or password

--- MORE LINES WITH :: Failed c2s & Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer ---

2024-02-21 14:29:25.458126+00:00 [warning] <0.19138.0>@ejabberd_c2s:process_auth_result/3:280 (tcp|<0.19138.0>) Failed c2s HHHHHHHH authentication for r2c4e_48518f402@customname from customip: Invalid username or password
2024-02-21 15:41:45.536850+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **1072045** messages queued by 7 process(es) (1%) from the following applications: ejabberd (**1072045** msgs); the top processes are:
** <0.22333.0>: msgs = 404701, memory = 1592112560, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22332.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22392.0>: msgs = 383314, memory = 2501936152, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22391.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22343.0>: msgs = 257521, memory = 1070150064, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22342.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22412.0>: msgs = 26502, memory = 424769504, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22411.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22332.0>: msgs = 3, memory = 2584558100, initial_call = ejabberd_bosh:init/1, current_function = lists:-filter/2-lc$^0/1-0-/2, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd

2024-02-21 15:41:45.537348+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 3 process(es) consuming more than 153149 message(s) each

2024-02-21 15:42:07.958236+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.22412.0> consumes more than 5% of OS memory (msgs = 762077, memory = 5504763272, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22411.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd)

2024-02-21 15:42:07.968733+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_overload/2:143 The system is overloaded with **762293** messages queued by 3 process(es) (1%) from the following applications: ejabberd (**762293** msgs); the top processes are:
** <0.22412.0>: msgs = 762288, memory = 5505483600, initial_call = xmpp_stream_in:init/1, current_function = gen:do_call/4, ancestors = [<0.22411.0>,mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22332.0>: msgs = 3, memory = 2584555140, initial_call = ejabberd_bosh:init/1, current_function = mnesia:dirty_index_read/3, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd
** <0.22342.0>: msgs = 2, memory = 3721758692, initial_call = ejabberd_bosh:init/1, current_function = mnesia:val/1, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd

2024-02-21 15:42:07.968996+00:00 [error] <0.86.0>@ejabberd_system_monitor:do_kill/2:295 Killed 1 process(es) consuming more than 254098 message(s) each

2024-02-21 15:45:08.022471+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.22411.0> consumes more than 5% of OS memory (msgs = 0, memory = 5359331744, initial_call = ejabberd_bosh:init/1, current_function = gen:do_call/4, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd)

2024-02-21 15:50:10.641543+00:00 [warning] <0.86.0>@ejabberd_system_monitor:handle_event/2:98 Process <0.22411.0> consumes more than 5% of OS memory (msgs = 0, memory = 5359331744, initial_call = ejabberd_bosh:init/1, current_function = gen:do_call/4, ancestors = [mod_bosh_customname,ejabberd_gen_mod_sup,ejabberd_sup,<0.110.0>], application = ejabberd)

--- MORE LINES WITH :: Failed c2s & Closing c2s session for xxxx@yyyyy/zzzz: Stream reset by peer ---

More info :

Please notice 👀 Where do 1+ MILLION messages come from? how can I see a sample of these "messages" ? (Could it be that those "stream reset by peer & failed authentications" are actually accumulating as undelivered messages? If so, can we forego those messages and have'm purged somehow?)

SQL Shows no "last or archive" nor any other relevant numbers.

Webadmin shows:

Statistics of ejabberd@localhost
Uptime: | 47402.294-- | --  | 2024-02-21 04:13:02
CPU Time: | 1520.227
Online Users: | 63
Transactions Committed: | 10
Transactions Aborted: | 6
Transactions Restarted: | 0
Transactions Logged: | 4

Database Tables at ejabberd@localhost				
				
Name			Elements	Memory
archive_msg			0	43,712
archive_prefs		        0	43,712
bosh				56	10,952
bytestream			0	2,440
caps_features		        0	43,712
ejabberd_commands	82	189,376
last_activity		        0	43,712
mod_register_ip		0	2,440
motd				0	43,712
motd_users			0	43,712
mqtt_pub			0	43,712
mqtt_session		        0	2,440
mqtt_sub			        0	1,112
muc_online_room		0	1,112
muc_registered		0	2,440
muc_room			0	2,440
oauth_client		        0	2,440
oauth_token			0	43,712
offline_msg			0	43,712
passwd				0	43,712
privacy				0	43,712
private_storage		0	43,712
pubsub_index		        0	2,440
pubsub_item			0	43,712
pubsub_last_item	        0	2,440
pubsub_node			0	2,440
pubsub_orphan		0	2,440
pubsub_state		        0	1,112
push_session		        0	43,712
reg_users_counter	        0	2,440
roster				0	43,712
roster_version		        0	43,712
route				6	3,512
route_multicast		0	2,440
s2s					0	2,440
schema				43	45,856
session				56	33,920
session_counter		0	2,440
sr_group			        0	2,440
sr_user				0	2,440
temporarily_blocked	0	2,440
vcard				0	43,712
vcard_search		        0	2,440

Any help would be greatly appreciated on debugging what is going on and why is this happening.

ps.- Also, there's no correlation with failed/closed sessions or any burst on network activity.

nstiac avatar Feb 21 '24 17:02 nstiac

Where does mod_bosh_r2webusers come from?

licaon-kter avatar Feb 22 '24 08:02 licaon-kter

I guess r2webusers is a host that they use.

First thing, is your client able to use websocket instead of bosh? If so please change it to use websocket, bosh is slower and uses significantly more resources.

To diagnose this you could use ejabberd debug console. When you start it you can see processes that have big queues by executing p1_prof:q(). in it (that dot at end is significant, don't skip it). This will list 10 processes with biggest queues. To see what messages are queued, copy pid(..., ..., ...) from that previous command (it will have three numbers), and execute erlang:process_info(pid(... ,... , ...), messages). with that pid.

I am guessing those messages will be xmpp packets that need to be delivered to client, but bosh connection is just to slow to deliver them at rate those arrive.

prefiks avatar Feb 22 '24 09:02 prefiks