Make GC more thread-safe
Here's my maybe naive way to fix #882.
Trying to aquire the GIL from the finalizer led to deadlocks. This is instead basically option 3 from here. Tests pass locally and this fixes the original issue, but I'm not an expert so maybe this has other consequences I'm not thinking of. Comments welcome.
Codecov Report
Merging #883 (bdc0edf) into master (5d227fc) will decrease coverage by
0.07%. The diff coverage is88.23%.
@@ Coverage Diff @@
## master #883 +/- ##
==========================================
- Coverage 67.61% 67.54% -0.08%
==========================================
Files 20 20
Lines 1967 1981 +14
==========================================
+ Hits 1330 1338 +8
- Misses 637 643 +6
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 67.54% <88.23%> (-0.08%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/pyinit.jl | 82.24% <77.77%> (-0.42%) |
:arrow_down: |
| src/PyCall.jl | 70.30% <100.00%> (+0.54%) |
:arrow_up: |
| src/pybuffer.jl | 61.44% <100.00%> (-0.46%) |
:arrow_down: |
| src/gc.jl | 69.23% <0.00%> (-30.77%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 5d227fc...bdc0edf. Read the comment docs.
See my comment here: https://github.com/JuliaPy/PyCall.jl/issues/882#issuecomment-786266352
I think we probably need to copy the FFTW.jl solution here, which was rather painfully developed with the help of @vtjnash.
(I wonder if we should refactor that solution out into a package so that both FFTW.jl and PyCall.jl can use the same code? The logic is pretty subtle to get right. Maybe put the package in https://github.com/JuliaParallel?)
Hmm, I'm not sure we can factor the FFTW.jl solution that cleanly into a package, since trylock(fftwlock) probably needs to be replaced with the GIL lock (or we need to protect every call to Python with our own lock, which would be unfortunate). Unfortunately, there doesn't seem to be any non-blocking analogue of trylock for the GIL? So, we might still need to re-think the logic a bit.
Reading the Python documentation some more, it seems like the solution may be simpler — we're not really allowed to call Python from anything other than the main thread unless we do a lot more work, so if Threads.threadid() != 1 (we aren't in the main thread) (or maybe check PyGILState_Check() == 1?) in the finalizer, just push the PyPtr to a queue of objects to decref (protected by a lock as in FFTW.jl), and otherwise both decref the objects and flush the queue.
I'm not sure I like the thought of putting the CPU in a spinloop
Yea youre right, I'm showing some of my ignorance on this stuff here.
Re: the last comment, I think I see your point and maybe a simple way to do it, also basically amounts (I think) to option two from that Julia doc link instead of option 3. Will give it a go.
Ok, this is quite simple and seems to work.
Actually, sorry, I realize this doesn't actually solve the segfault I was seeing in https://github.com/JuliaPy/PyCall.jl/issues/881#issuecomment-785780082. I was getting confused since that only happens on this one system. So fwiw, I don't have an evironment where this PR fixes anything, either its not needed, or I get those segfaults regardless :shrug: granted it seems like a right thing to do.
Is it sufficient to guarantee that Thread.threadid() == 1? Perhaps we also need confirm that Base.current_task() === Base.roottask to ensure the stack is in a valid state for Python.
In JavaCall.jl, this is a necessary condition for some operations due to how Julia addresses the stack for each Task and how Java uses signalling. See the JULIA_COPY_STACKS environment variable and ALWAYS_COPY_STACKS option.
https://julialang.org/blog/2019/07/multithreading/#allocating_and_switching_task_stacks
Fixed the return value. Up to you to decide what to do with this PR. Personally I'm far from sure on what the right solution is, if one is needed at the moment at all .
What does PythonCall.jl do, @cjdoris?
What does PythonCall.jl do, @cjdoris?
All the finalisers in PythonCall go via enqueue or enqueue_all here.
These functions either immediately decref the pointer, or add it to a list of pointers to decref later. However, this is currently purely controlled by explicit PythonCall.GC.disable()/enable(), and like most other PythonCall functions should only be called from thread 1. It could be changed to check the thread and only immediately decref on thread 1 - which I think is the subject of the current PR, so I'll wait to see how you do it safely!
So fwiw, I don't have an environment where this PR fixes anything, either its not needed, or I get those segfaults regardless
@marius311, is there any update on whether this PR observably fixes anything?
I'm guessing that it may be more crucial now that Julia's GC is multi-threaded (https://github.com/JuliaLang/julia/pull/48600) in Julia 1.10.
I've been seeing a lot more segfaults on recent Julia + Python versions (x-ref #1072) so I was wondering if I could help revive this PR – @marius311 are you up for rebasing this and getting it ready for some new reviews?
I also think even though this PR might not be the optimal GC strategy, at least it's safe. We can always improve it later.