Messages hang in pending state when database connection fails
Describe the bug
In my 3.3.4 version of Postal I have this problem that the database sometimes loses connection.
Jul 10 08:55:44 post mariadbd[626]: 2024-07-10 6:55:44+0000 482 [Warning] Aborted connection 482 to db: 'unconnected' user: 'postal' host: 'localhost' (Got timeout reading communication packets)
That needs to be fixed on my server... BUT Postal is not able to handle this. The message which the worker is trying to send hangs in pending state after this:
2024-07-10 06:55:45 +0000 DEBUG acquired task role by renewing it component=worker thread=tasks
2024-07-10 06:55:45 +0000 INFO running task component=worker thread=tasks task=SendNotificationsScheduledTask
2024-07-10 06:55:45 +0000 INFO scheduling task to next run at 2024-07-10 06:56:45 UTC component=worker thread=tasks task=SendNotificationsScheduledTask
2024-07-10 06:56:45 +0000 DEBUG acquired task role by renewing it component=worker thread=tasks
2024-07-10 06:56:45 +0000 INFO running task component=worker thread=tasks task=SendNotificationsScheduledTask
2024-07-10 06:56:45 +0000 INFO scheduling task to next run at 2024-07-10 06:57:45 UTC component=worker thread=tasks task=SendNotificationsScheduledTask
2024-07-10 06:57:03 +0000 INFO starting message unqueue component=worker thread=work0 original_queued_message=1134326
2024-07-10 06:57:03 +0000 INFO queue latency is 2.470646619796753s component=worker thread=work0 original_queued_message=1134326
2024-07-10 06:57:03 +0000 ERROR Mysql2::Error::ConnectionError (MySQL server has gone away) component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:151:in `_query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:151:in `block in query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:150:in `handle_interrupt' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:150:in `query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /opt/postal/app/lib/postal/message_db/database.rb:326:in `query_on_connection' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /opt/postal/app/lib/postal/message_db/database.rb:318:in `block in query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /opt/postal/app/lib/postal/message_db/connection_pool.rb:20:in `use' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /opt/postal/app/lib/postal/message_db/database.rb:345:in `with_mysql' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /opt/postal/app/lib/postal/message_db/database.rb:317:in `query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /opt/postal/app/lib/postal/message_db/database.rb:177:in `select' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /opt/postal/app/lib/postal/message_db/message.rb:12:in `find_one' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /opt/postal/app/lib/postal/message_db/database.rb:47:in `message' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /opt/postal/app/app/models/concerns/has_message.rb:12:in `message' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /opt/postal/app/app/lib/message_dequeuer/initial_processor.rb:42:in `check_message_exists' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /opt/postal/app/app/lib/message_dequeuer/initial_processor.rb:16:in `block (2 levels) in process' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /opt/postal/app/app/lib/message_dequeuer/base.rb:38:in `catch_stops' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR /opt/postal/app/app/lib/message_dequeuer/initial_processor.rb:14:in `block in process' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /usr/local/bundle/gems/klogger-logger-1.4.0/lib/klogger/group_set.rb:18:in `call_without_id' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /usr/local/bundle/gems/klogger-logger-1.4.0/lib/klogger/logger.rb:72:in `tagged' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/message_dequeuer/initial_processor.rb:11:in `process' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/message_dequeuer.rb:9:in `process' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/jobs/process_queued_messages_job.rb:67:in `block in process_messages' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /usr/local/bundle/gems/activerecord-7.0.8.1/lib/active_record/relation/delegation.rb:88:in `each' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /usr/local/bundle/gems/activerecord-7.0.8.1/lib/active_record/relation/delegation.rb:88:in `each' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/jobs/process_queued_messages_job.rb:65:in `process_messages' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/jobs/process_queued_messages_job.rb:14:in `call' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/process.rb:164:in `block (4 levels) in work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /usr/local/lib/ruby/3.2.0/benchmark.rb:311:in `realtime' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/process.rb:164:in `block (3 levels) in work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/process.rb:285:in `capture_errors' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/process.rb:161:in `block (2 levels) in work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/process.rb:160:in `each' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/process.rb:160:in `block in work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /usr/local/bundle/gems/activerecord-7.0.8.1/lib/active_record/connection_adapters/abstract/connection_pool.rb:215:in `with_connection' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/process.rb:159:in `work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/process.rb:142:in `block (3 levels) in start_work_thread' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/process.rb:141:in `loop' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/process.rb:141:in `block (2 levels) in start_work_thread' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /usr/local/bundle/gems/klogger-logger-1.4.0/lib/klogger/group_set.rb:18:in `call_without_id' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /usr/local/bundle/gems/klogger-logger-1.4.0/lib/klogger/logger.rb:72:in `tagged' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR /opt/postal/app/app/lib/worker/process.rb:139:in `block in start_work_thread' component=worker thread=work0
2024-07-10 06:57:06 +0000 INFO starting message unqueue component=worker thread=work1 original_queued_message=1134327
2024-07-10 06:57:06 +0000 INFO queue latency is 4.297773838043213s component=worker thread=work1 original_queued_message=1134327
2024-07-10 06:57:06 +0000 INFO found 16 associated messages to process at the same time batch_key=outgoing-gmail.com component=worker thread=work1 original_queued_message=1134327
2024-07-10 06:57:06 +0000 INFO parsing message content as it hasn't been parsed before component=worker thread=work1 original_queued_message=1134327 queued_message=1134327
2024-07-10 06:57:06 +0000 INFO Connected to <CUT> log_id=Y75FMTZH component=worker thread=work1 original_queued_message=1134327 queued_message=1134327
2024-07-10 06:57:06 +0000 INFO Sending message 1::4766770 to <CUT> log_id=Y75FMTZH component=worker thread=work1 original_queued_message=1134327 queued_message=1134327
2024-07-10 06:57:06 +0000 INFO Accepted by <CUT> log_id=Y75FMTZH component=worker thread=work1 original_queued_message=1134327 queued_message=1134327
Now the message hangs in pending state because it is still locked by the failing worker process:
select id,message_id,locked_by,locked_at,retry_after,created_at,updated_at from queued_messages where message_id=4766769;
+---------+------------+--------------------------------------------+----------------------------+-------------+----------------------------+----------------------------+
| id | message_id | locked_by | locked_at | retry_after | created_at | updated_at |
+---------+------------+--------------------------------------------+----------------------------+-------------+----------------------------+----------------------------+
| 1134326 | 4766769 | host:mail pid:9 thread:12 22c1cddf97171ce2 | 2024-07-10 06:57:03.968458 | NULL | 2024-07-10 06:57:01.510895 | 2024-07-10 06:57:01.510895 |
+---------+------------+--------------------------------------------+----------------------------+-------------+----------------------------+----------------------------+
1 row in set (0.001 sec)
The only way to get it through now is setting locked_at and locked_by fields to null.
Expected behaviour
The message should be cleaned properly by the worker so it can be picked up by another process the next time the database is back.
Environment details
- OS: Ubuntu 22.04.4 LTS
- Version 3.3.4
I think this has been a problem since forever and I am not sure there is an easy solution. This is probably easier to solve now the queue is in the database as opposed to RabbitMQ.
Many people find they need to run multiple workers to consume their queue in a timely manner so process identification is tricky as the workers are not able to communicate with each other and I do not think Docker would be able to tell each worker about every other worker.
One solution could be a configurable lock retry setting so if locked_at was more than say an hour ago, the message is picked up again.
Thank you again for the quick responses. Really appreciate it. Might it be an idea to give the database layer some kind of retry-mechanism?
https://github.com/postalserver/postal/blob/da90e75036c27482699921613d838f4058a100e7/lib/postal/message_db/database.rb
Disclaimer: untested code not written by a human ;p
def query(query)
retries = 0
begin
with_mysql do |mysql|
query_on_connection(mysql, query)
end
rescue Mysql2::Error::ConnectionError => e
if retries < 1 # Limit the number of retries to 1
retries += 1
logger.warn "Connection error on query, retrying... (Attempt #{retries + 1})"
retry
else
logger.error "Repeated connection errors on query, giving up. Error: #{e.message}"
raise
end
end
end
Seems to be a duplicate of #3023
+1 - this issue also happens to me
postal logs:
postal-web-1 | 2024-09-26 20:52:15 +0000 ERROR POST /api/v1/send/message (500) event=request transaction=[snip] controller=LegacyAPI::SendController action=message format=html method=POST path=/api/v1/send/message request_id=[snip] ip_address=[snip] status=500 db_runtime=6.752912521362305 exception_class=Mysql2::Error::ConnectionError exception_message=MySQL server has gone away exception_backtrace=/usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:151:in `_query'\n/usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:151:in `block in query'\n/usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:150:in `handle_interrupt'\n/usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:150:in `query'
mariadb logs:
[Warning] Aborted connection 34409 to db: 'unconnected' user: 'x' host: 'localhost' (Got timeout reading communication packets)
Can confirm that I am experiencing the same issue. Thanks to @Ronnnn for the work-around "setting locked_at and locked_by fields to null".