Tasks restart when one of rabbitmq (broker) nodes goes down or loss connection (even for a few sec)
Checklist
- [ ] I have verified that the issue exists against the
masterbranch of Celery. - [ ] This has already been asked to the discussion group first.
- [x] I have read the relevant section in the contribution guide on reporting bugs.
- [x] I have checked the issues list for similar or identical bug reports.
- [x] I have checked the pull requests list for existing proposed fixes.
- [x] I have checked the commit log to find out if the bug was already fixed in the master branch.
- [x] I have included all related issues and possible duplicate issues in this issue (If there are none, check this box anyway).
Mandatory Debugging Information
- [x] I have included the output of
celery -A proj reportin the issue. (if you are not able to do this, then at least specify the Celery version affected). - [ ] I have verified that the issue exists against the
masterbranch of Celery. - [ ] I have included the contents of
pip freezein the issue. - [x] I have included all the versions of all the external dependencies required to reproduce this bug.
Optional Debugging Information
- [ ] I have tried reproducing the issue on more than one Python version and/or implementation.
- [ ] I have tried reproducing the issue on more than one message broker and/or result backend.
- [ ] I have tried reproducing the issue on more than one version of the message broker and/or result backend.
- [x] I have tried reproducing the issue on more than one operating system.
- [x] I have tried reproducing the issue on more than one workers pool.
- [ ] I have tried reproducing the issue with autoscaling, retries, ETA/Countdown & rate limits disabled.
- [ ] I have tried reproducing the issue after downgrading and/or upgrading Celery and its dependencies.
Related Issues and Possible Duplicates
Related Issues
- https://github.com/celery/celery/issues/3921
Possible Duplicates
- None
Environment & Settings
Celery version:
celery report Output:
Steps to Reproduce
Required Dependencies
- Minimal Python Version: N/A or Unknown
- Minimal Celery Version: N/A or Unknown
- Minimal Kombu Version: N/A or Unknown
- Minimal Broker Version: N/A or Unknown
- Minimal Result Backend Version: N/A or Unknown
- Minimal OS and/or Kernel Version: N/A or Unknown
- Minimal Broker Client Version: N/A or Unknown
- Minimal Result Backend Client Version: N/A or Unknown
Python Packages
pip freeze Output:
Other Dependencies
rabbitmq 3.8.22 celery 5.1.2
N/A
Minimally Reproducible Test Case
-A celery_tasks worker -O fair -Q api -l info -c1 --without-gossip --without-mingle
My configs:
app.conf.broker_transport_options = {"visibility_timeout": 259200, "max_retries": 1}
app.conf.task_acks_late = True
app.conf.worker_cancel_long_running_tasks_on_connection_loss = True
app.conf.task_compression = 'gzip'
app.conf.result_compression = 'gzip'
app.conf.result_expires = 60 * 60 * 24 * 4
app.conf.broker_connection_timeout = 60
app.conf.broker_connection_max_retries = 2000
app.conf.broker_failover_strategy = 'round-robin'
app.conf.worker_prefetch_multiplier = 1
app.conf.worker_lost_wait = 120
- Start a long-running task.
- Turn off one of the nodes from rabbitmq clsuter (you should ha)
- worker restarts the task without trying to connect to other nodes or reconnect to the current one
Expected Behavior
As some tasks of ours take 24 hours to be finished, we expect it to run regardless of problems with the broker nodes in the cluster and at least to try to reconnect before restarting the tasks.
Maybe provide some config to wait or try to connect one more time before doing worker_cancel_long_running_tasks_on_connection_loss?
Actual Behavior
The worker loses connection for a few seconds to the rabbitmq node and it restarts all tasks immediately regardless that other rabbitmq nodes are online.
Hey @vlad-outscraper :wave:, Thank you for opening an issue. We will get back to you as soon as we can. Also, check out our Open Collective and consider backing us - every little helps!
We also offer priority support for our sponsors. If you require immediate assistance please consider sponsoring us.
P.S. we love your solution and we do support you guys. Just started to migrate our copy of celery to be a failover system and faced those problems.
I just happened to come across this issue when searching for a different issue, and I'm afraid this behaviour is just a result of how RabbitMQ works. Since you are using late acks, Celery needs to acknowledge the task when it is done running. Acknowledging a task means sending a command back to RabbitMQ which refers to the original message that was delivered.
The problem is that the reference to the original message is scoped to the TCP connection. Once the connection is lost, it is impossible to acknowledge any messages that were received on that connection. There is nothing Celery can do to avoid this problem. This is why the configuration option worker_cancel_long_running_tasks_on_connection_loss was introduced in the first place.
As a result, late acknowledgements is incompatible with long-running tasks.
That said, we've been having issues even without late acks. The same task is queued multiple times, because it is selected for running even though the connection was lost before it could start running (and therefore be acknowledged), so the task was delivered back to the queue and redelivered…
Also, with late_ack=False the celery worker can't survive a rabbitmq lost connection. It seems that kombu or billiard gets really confused when they try to reconnect and at the end the tasks before the lost connection will remain active for ever even if they complete.