python-task-queue icon indicating copy to clipboard operation
python-task-queue copied to clipboard

LocalTaskQueue Hanging

Open william-silversmith opened this issue 6 years ago • 4 comments

Some people are reporting problems with parallel operation hanging. We'll have to figure out how to reproduce.

william-silversmith avatar Sep 20 '19 01:09 william-silversmith

Seems like one cause of this can be when threads are mixed with forked processes as locks get copied in memory without a way to release them.

william-silversmith avatar Jul 24 '20 05:07 william-silversmith

I think this is perhaps related: On a run of about ~1.1M tasks on a local fq, RAM gradually fills up over time. I think this is not a practical issue (after over a million task completions, I was only seeing about ~140GB of RAM used up; so this is a VERY gradual leak, if it is indeed a leak). I don't think this has anything to do with the Igneous tasks (that spawned the queue) themselves, since the memory per job there would have filled up RAM much more rapidly if it hadn't been deallocating, so I suspect this is something queue-side. Wish I had more details for you but the execution isn't running anymore; I still have the queue filesystem and am happy to do some digging if helpful there!)

If nothing else, hopefully this gives you a feel for the rate of the memory growth..? Anyhow, feel free to ignore this! Just wanted to give you an extra data point.

j6k4m8 avatar Feb 09 '22 17:02 j6k4m8

Thanks Jordan! That's still 121 kB per task, not just a few bytes. I can look into this myself at some point soon, but if you have time would you mind running a taskqueue with an empty task using mprof (memory profiler) and post the .dat file? That's usually where I would start.

Someone once reported a similar but critical issue with Igneous that I wasn't able to reproduce. I suspected it had something to do with their system configuration. They did say that using a different filesystem didn't help though.

https://github.com/seung-lab/igneous/issues/79

william-silversmith avatar Feb 09 '22 18:02 william-silversmith

Unfortunately it doesn't look like the mprof output has much to say: Here's a profile of another set of workers I spawned on the same job as where I saw the issues (i.e., a younger set of workers, but otherwise same exact conditions). It's mostly just a monotonic increase in memory.

mprof.dat.txt

I will run an mprof on a completely empty task on the same machine + same conditions the next time I'm on it (probably later this week!)

j6k4m8 avatar Feb 10 '22 14:02 j6k4m8