celery icon indicating copy to clipboard operation
celery copied to clipboard

Tasks restart when one of rabbitmq (broker) nodes goes down or loss connection (even for a few sec)

Open vlad-stack opened this issue 4 years ago • 4 comments

Checklist

  • [ ] I have verified that the issue exists against the master branch of Celery.
  • [ ] This has already been asked to the discussion group first.
  • [x] I have read the relevant section in the contribution guide on reporting bugs.
  • [x] I have checked the issues list for similar or identical bug reports.
  • [x] I have checked the pull requests list for existing proposed fixes.
  • [x] I have checked the commit log to find out if the bug was already fixed in the master branch.
  • [x] I have included all related issues and possible duplicate issues in this issue (If there are none, check this box anyway).

Mandatory Debugging Information

  • [x] I have included the output of celery -A proj report in the issue. (if you are not able to do this, then at least specify the Celery version affected).
  • [ ] I have verified that the issue exists against the master branch of Celery.
  • [ ] I have included the contents of pip freeze in the issue.
  • [x] I have included all the versions of all the external dependencies required to reproduce this bug.

Optional Debugging Information

  • [ ] I have tried reproducing the issue on more than one Python version and/or implementation.
  • [ ] I have tried reproducing the issue on more than one message broker and/or result backend.
  • [ ] I have tried reproducing the issue on more than one version of the message broker and/or result backend.
  • [x] I have tried reproducing the issue on more than one operating system.
  • [x] I have tried reproducing the issue on more than one workers pool.
  • [ ] I have tried reproducing the issue with autoscaling, retries, ETA/Countdown & rate limits disabled.
  • [ ] I have tried reproducing the issue after downgrading and/or upgrading Celery and its dependencies.

Related Issues and Possible Duplicates

Related Issues

  • https://github.com/celery/celery/issues/3921

Possible Duplicates

  • None

Environment & Settings

Celery version:

celery report Output:


Steps to Reproduce

Required Dependencies

  • Minimal Python Version: N/A or Unknown
  • Minimal Celery Version: N/A or Unknown
  • Minimal Kombu Version: N/A or Unknown
  • Minimal Broker Version: N/A or Unknown
  • Minimal Result Backend Version: N/A or Unknown
  • Minimal OS and/or Kernel Version: N/A or Unknown
  • Minimal Broker Client Version: N/A or Unknown
  • Minimal Result Backend Client Version: N/A or Unknown

Python Packages

pip freeze Output:


Other Dependencies

rabbitmq 3.8.22 celery 5.1.2

N/A

Minimally Reproducible Test Case

-A celery_tasks worker -O fair -Q api -l info -c1 --without-gossip --without-mingle

My configs:

app.conf.broker_transport_options = {"visibility_timeout": 259200, "max_retries": 1}
app.conf.task_acks_late = True
app.conf.worker_cancel_long_running_tasks_on_connection_loss = True
app.conf.task_compression = 'gzip'
app.conf.result_compression = 'gzip'
app.conf.result_expires = 60 * 60 * 24 * 4
app.conf.broker_connection_timeout = 60
app.conf.broker_connection_max_retries = 2000
app.conf.broker_failover_strategy = 'round-robin'
app.conf.worker_prefetch_multiplier = 1
app.conf.worker_lost_wait = 120
  1. Start a long-running task.
  2. Turn off one of the nodes from rabbitmq clsuter (you should ha)
  3. worker restarts the task without trying to connect to other nodes or reconnect to the current one

Expected Behavior

As some tasks of ours take 24 hours to be finished, we expect it to run regardless of problems with the broker nodes in the cluster and at least to try to reconnect before restarting the tasks.

Maybe provide some config to wait or try to connect one more time before doing worker_cancel_long_running_tasks_on_connection_loss?

Actual Behavior

The worker loses connection for a few seconds to the rabbitmq node and it restarts all tasks immediately regardless that other rabbitmq nodes are online.

vlad-stack avatar Sep 09 '21 11:09 vlad-stack

Hey @vlad-outscraper :wave:, Thank you for opening an issue. We will get back to you as soon as we can. Also, check out our Open Collective and consider backing us - every little helps!

We also offer priority support for our sponsors. If you require immediate assistance please consider sponsoring us.

P.S. we love your solution and we do support you guys. Just started to migrate our copy of celery to be a failover system and faced those problems.

vlad-stack avatar Sep 09 '21 11:09 vlad-stack

I just happened to come across this issue when searching for a different issue, and I'm afraid this behaviour is just a result of how RabbitMQ works. Since you are using late acks, Celery needs to acknowledge the task when it is done running. Acknowledging a task means sending a command back to RabbitMQ which refers to the original message that was delivered.

The problem is that the reference to the original message is scoped to the TCP connection. Once the connection is lost, it is impossible to acknowledge any messages that were received on that connection. There is nothing Celery can do to avoid this problem. This is why the configuration option worker_cancel_long_running_tasks_on_connection_loss was introduced in the first place.

As a result, late acknowledgements is incompatible with long-running tasks.

That said, we've been having issues even without late acks. The same task is queued multiple times, because it is selected for running even though the connection was lost before it could start running (and therefore be acknowledged), so the task was delivered back to the queue and redelivered…

tobinus avatar Apr 24 '23 08:04 tobinus

Also, with late_ack=False the celery worker can't survive a rabbitmq lost connection. It seems that kombu or billiard gets really confused when they try to reconnect and at the end the tasks before the lost connection will remain active for ever even if they complete.

lefterisnik avatar Sep 18 '23 20:09 lefterisnik