PyCall.jl icon indicating copy to clipboard operation
PyCall.jl copied to clipboard

Make GC more thread-safe

Open marius311 opened this issue 4 years ago • 16 comments

Here's my maybe naive way to fix #882.

Trying to aquire the GIL from the finalizer led to deadlocks. This is instead basically option 3 from here. Tests pass locally and this fixes the original issue, but I'm not an expert so maybe this has other consequences I'm not thinking of. Comments welcome.

marius311 avatar Feb 25 '21 21:02 marius311

Codecov Report

Merging #883 (bdc0edf) into master (5d227fc) will decrease coverage by 0.07%. The diff coverage is 88.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #883      +/-   ##
==========================================
- Coverage   67.61%   67.54%   -0.08%     
==========================================
  Files          20       20              
  Lines        1967     1981      +14     
==========================================
+ Hits         1330     1338       +8     
- Misses        637      643       +6     
Flag Coverage Δ
unittests 67.54% <88.23%> (-0.08%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/pyinit.jl 82.24% <77.77%> (-0.42%) :arrow_down:
src/PyCall.jl 70.30% <100.00%> (+0.54%) :arrow_up:
src/pybuffer.jl 61.44% <100.00%> (-0.46%) :arrow_down:
src/gc.jl 69.23% <0.00%> (-30.77%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5d227fc...bdc0edf. Read the comment docs.

codecov-io avatar Feb 25 '21 21:02 codecov-io

See my comment here: https://github.com/JuliaPy/PyCall.jl/issues/882#issuecomment-786266352

I think we probably need to copy the FFTW.jl solution here, which was rather painfully developed with the help of @vtjnash.

(I wonder if we should refactor that solution out into a package so that both FFTW.jl and PyCall.jl can use the same code? The logic is pretty subtle to get right. Maybe put the package in https://github.com/JuliaParallel?)

stevengj avatar Feb 25 '21 22:02 stevengj

Hmm, I'm not sure we can factor the FFTW.jl solution that cleanly into a package, since trylock(fftwlock) probably needs to be replaced with the GIL lock (or we need to protect every call to Python with our own lock, which would be unfortunate). Unfortunately, there doesn't seem to be any non-blocking analogue of trylock for the GIL? So, we might still need to re-think the logic a bit.

stevengj avatar Feb 25 '21 23:02 stevengj

Reading the Python documentation some more, it seems like the solution may be simpler — we're not really allowed to call Python from anything other than the main thread unless we do a lot more work, so if Threads.threadid() != 1 (we aren't in the main thread) (or maybe check PyGILState_Check() == 1?) in the finalizer, just push the PyPtr to a queue of objects to decref (protected by a lock as in FFTW.jl), and otherwise both decref the objects and flush the queue.

stevengj avatar Feb 25 '21 23:02 stevengj

I'm not sure I like the thought of putting the CPU in a spinloop

Yea youre right, I'm showing some of my ignorance on this stuff here.

Re: the last comment, I think I see your point and maybe a simple way to do it, also basically amounts (I think) to option two from that Julia doc link instead of option 3. Will give it a go.

marius311 avatar Feb 26 '21 00:02 marius311

Ok, this is quite simple and seems to work.

marius311 avatar Feb 26 '21 01:02 marius311

Actually, sorry, I realize this doesn't actually solve the segfault I was seeing in https://github.com/JuliaPy/PyCall.jl/issues/881#issuecomment-785780082. I was getting confused since that only happens on this one system. So fwiw, I don't have an evironment where this PR fixes anything, either its not needed, or I get those segfaults regardless :shrug: granted it seems like a right thing to do.

marius311 avatar Feb 26 '21 03:02 marius311

Is it sufficient to guarantee that Thread.threadid() == 1? Perhaps we also need confirm that Base.current_task() === Base.roottask to ensure the stack is in a valid state for Python.

In JavaCall.jl, this is a necessary condition for some operations due to how Julia addresses the stack for each Task and how Java uses signalling. See the JULIA_COPY_STACKS environment variable and ALWAYS_COPY_STACKS option. https://julialang.org/blog/2019/07/multithreading/#allocating_and_switching_task_stacks

mkitti avatar Feb 26 '21 09:02 mkitti

Fixed the return value. Up to you to decide what to do with this PR. Personally I'm far from sure on what the right solution is, if one is needed at the moment at all .

marius311 avatar Feb 26 '21 20:02 marius311

What does PythonCall.jl do, @cjdoris?

stevengj avatar Aug 16 '23 23:08 stevengj

What does PythonCall.jl do, @cjdoris?

All the finalisers in PythonCall go via enqueue or enqueue_all here.

These functions either immediately decref the pointer, or add it to a list of pointers to decref later. However, this is currently purely controlled by explicit PythonCall.GC.disable()/enable(), and like most other PythonCall functions should only be called from thread 1. It could be changed to check the thread and only immediately decref on thread 1 - which I think is the subject of the current PR, so I'll wait to see how you do it safely!

cjdoris avatar Aug 19 '23 17:08 cjdoris

So fwiw, I don't have an environment where this PR fixes anything, either its not needed, or I get those segfaults regardless

@marius311, is there any update on whether this PR observably fixes anything?

I'm guessing that it may be more crucial now that Julia's GC is multi-threaded (https://github.com/JuliaLang/julia/pull/48600) in Julia 1.10.

stevengj avatar Aug 19 '23 18:08 stevengj

I've been seeing a lot more segfaults on recent Julia + Python versions (x-ref #1072) so I was wondering if I could help revive this PR – @marius311 are you up for rebasing this and getting it ready for some new reviews?

MilesCranmer avatar Jan 08 '24 18:01 MilesCranmer

I also think even though this PR might not be the optimal GC strategy, at least it's safe. We can always improve it later.

MilesCranmer avatar Jan 08 '24 18:01 MilesCranmer