celery-singleton Multiple tasks do not generate singleton locks

I'm running two EC2 instances connected to 1 Redis cache. Each EC2 instance is running a celery beat and 4 workers of celery 4.4.7 and celery-singleton 0.3.1. I'm trying to use celery-singleton to only queue/run one copy of the task from whichever celery beat schedules first. I have 14 tasks configured with base=Singleton and lock_expiry > task time. I see only 10 singleton lock keys in my Redis. Out of the 14 tasks, 9 of them run twice. Out of the 14 tasks, 3 have parameters. One of those runs once, the other 2 run twice.

One of the tasks is configured with @app.task(base=Singleton, lock_expiry=60, raise_on_duplicate=True) and it still runs two tasks and does not raise an exception.

What could be going wrong so that my 14 configured CELERY_BEAT_SCHEDULE tasks are generating 23 tasks and 10 singleton locks? Why isn't raise_on_duplicate triggering?

Mar 05 '21 07:03 denisemauldin

Did you find a solution? If still an issue, could you provide a minimal example to reproduce?

Mar 31 '21 23:03 steinitzu

I did not find a solution. Reproduction will probably be hard. If you don't have any thoughts on what troubleshooting paths I can go down then you can close this issue.

Apr 01 '21 01:04 denisemauldin

I am also in a difficult position to repro but have something that at face value seems to be a failure of singletons to block duplicate tasks...

Jun 18 '21 14:06 jtkaufman737

We ended up generating an additional redis cache key for tasks that were very important to not have run as duplicates.

Jun 18 '21 14:06 denisemauldin

@steinitzu it appears that I can reproduce what should be a singleton function running more than once simultaneously, but before I go farther with a reproducible example or anything want to make sure this isn't the intended behavior - I see functions that should be singletons running simultaneously across separate celery pods. I can't really tell from the docs whether that would be intended behavior or not for separate pods...any guidance appreciated

Jun 21 '21 15:06 jtkaufman737

Hey @jtkaufman737 What do you mean by pods exactly?

The locking is all coordindated centrally in redis. So this shouldn't happen regardless of how many processes you have dispatching jobs or how many workers you have.

Jun 21 '21 16:06 steinitzu

Hi thanks so much for following up, and sorry for the confusing term - we use several containers for various celery tasks to handle how many we have, its just docker but for some reason my team calls them pods. Not sure why lol. But I can take it from your answer that it shouldn't matter so long as they all communicate back to one redis instance

Jun 23 '21 13:06 jtkaufman737

For whatever it is worth, I've been able to reproduce a few times now by adding logging to the beginning and end of a function to mark start/end, and getting the "FUNCTION STARTING" logs multiple times consecutively without the "FUNCTION END" log displaying. I know its not that helpful without a minimum example for you to look at but its a proprietary app, I'll work on getting something more demoable because at this point it seems like I have been able to get this behavior going a few times, at least three yesterday (although that was out of many, many runs so seems itermittent)

Jun 23 '21 20:06 jtkaufman737

I would also log the task ID for this. Check if the tasks have the same ID or different IDs.

To rule out any issue with celery itself where 2+ workers might consume the same job in parallel.

Jun 23 '21 21:06 steinitzu

I think it shouldn't be that...it looks like we already have enabled uuid logging so one of the things I captured was like this

celery-slow_2  | [2021-06-22 20:49:40,963: ERROR/ForkPoolWorker-1] Celery job "sde.tasks.slow_compose_v2" (a1abe82f-df12-4405-b18c-98ad4433f1ee) starting: args=[], kwargs={}
celery-slow_2  | [2021-06-22 20:49:40,963: WARNING/ForkPoolWorker-1] SLOW COMPOSE V2 STARTING AT TIME 20:49:40 > > > > > > >  > > > > > > > >  > >
celery-slow_1  | [2021-06-22 20:50:09,576: ERROR/ForkPoolWorker-1] Celery job "sde.tasks.slow_compose_v2_stream" (10a28cef-05ed-44b6-b1f6-9a1df4b51a6f) finished: retval=None
celery-slow_1  | [2021-06-22 20:50:09,580: ERROR/ForkPoolWorker-1] Celery job "sde.tasks.slow_compose_v2" (da2b2885-635d-4726-b14d-e217ed4de7d8) starting: args=[], kwargs={}
celery-slow_1  | [2021-06-22 20:50:09,581: WARNING/ForkPoolWorker-1] SLOW COMPOSE V2 STARTING AT TIME 20:50:09 > > > > > > >  > > > > > > > >  > >

I see in both of the areas where it says "Celery job "sde.tasks.slow_compose_v2" starting" that there are different UUIDs and they are in different containers. Will still continue working on a minimal example where you could see this in action. Unless I'm misinterpreting what I'm seeing here which, figured worth sharing it just for that possibility

Jun 24 '21 13:06 jtkaufman737

For some who experience issues troubleshooting this, I've added a request to add logging: #45

This would improve troubleshooting / debuggability and the info provided with questions and bug reports.

The reason why is it's not known for sure if it's a genuine ID mismatch (logging would help you identify the disparity) or something else. Too many possibilities.

Mar 13 '22 14:03 tony

In our case the dublicate tasks had the same UUID. We finally fixed this issue by changing our Celery settings:

# Set a global 1-hour task timeout. You may override this on individual tasks by decorating it with:
# @app.task(soft_time_limit=<TIME_IN_SECONDS>)
CELERY_TASK_SOFT_TIME_LIMIT = 3600

# Some tasks have their time limits overridden as described above. Track the longest timeout value we use
# The CELERY_SINGLETON_LOCK_EXPIRY and redis visibility timeouts must never be less than the below value
CELERY_LONGEST_SOFT_TIME_LIMIT = 10800

# Expire locks after three hours, which is the longest task time limit.
# https://github.com/steinitzu/celery-singleton#app-configuration
CELERY_SINGLETON_LOCK_EXPIRY = CELERY_LONGEST_SOFT_TIME_LIMIT

# Prevent duplicate task runs when a task takes more than 1 hour
# Needed in addition to above because we use late acknowledgments
# https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/redis.html#redis-caveats
CELERY_BROKER_TRANSPORT_OPTIONS = {"visibility_timeout": CELERY_LONGEST_SOFT_TIME_LIMIT}

The above is needed if you set CELERY_TASK_ACKS_LATE = True, which causes long-running tasks to remain in the queue until they finish, instead of "until they start".

If they take more than CELERY_TASK_SOFT_TIME_LIMIT (default 1 hour?), either the CELERY_SINGLETON_LOCK_EXPIRY timeout may be reached (default 1 hour?), or the Redis visibility_timeout is reached (default 1 hour) and the task will be executed by another worker.

So technically this issue can be closed, since it's a bug in Celery settings and not celery-singleton's logic. But I think it would be helpful to document the above since this took quite a while to uncover.

May 23 '22 15:05 juspence