Break Artist._remove_method reference cycle
PR summary
I think I've found a reference cycle that can consume quite a bit of memory. The size of the reachable bytes from the reference cycle seems to grow linearly with the number of points I've plotted and Figure.clear() does not remove them. gc.collect() would remove this memory, but in performance critical applications, waiting for the GC is no fun.
The reference cycle looks like this:
Axes -> _children -> Artist (Line2D for example)
^ |
\ | field: "_remove_method"
\ v
_children.remove
(credit to refcycle for helping me find it)
This PR breaks the cycle by emptying out the Axes._children list instead of replacing it with an empty list. This removes the reference from the list to the Artist, allowing the reference count to reach 0 on the Artist.
I've found this method to be an effective workaround (without modifying matplotlib)
def real_clear(fig):
artists = fig.get_children()
for ax in fig.axes:
artists.extend(ax.get_children())
for artist in artists:
try:
artist.remove()
except NotImplementedError:
pass
fig.clear()
But I would like to contribute this memory usage improvement back into matplotlib. Though I'm not sure the best way. This PR is the smallest and simplest change, but I'm not certain if it covers enough similar reference cycles or just the one that showed up in my profiling. I also don't know if this is a clean way to solve the problem because I'm not very familiar with matplotlib internals. I considered making Artist._remove_method a weakref but that seems like a larger change that I'm not ready for (yet?). How do you think we should approach this reference cycle?
Note that this PR branches from https://github.com/matplotlib/matplotlib/commit/a47e26bd8583f0fcacc1dd83c3f2ac39b1f7a091 (not latest main) because of the issues I encountered with the next commit (See https://github.com/matplotlib/matplotlib/issues/28866 and https://github.com/matplotlib/matplotlib/commit/597554db667344c4c4ad026ec1c6dd5f4c688d8f#commitcomment-147024800).
Here is the benchmark script that I've been using to evaluate the effectiveness of these changes:
import gc
import matplotlib
matplotlib.use("Agg")
from matplotlib.figure import Figure
from numpy.random import rand
import psutil
def remove_artists(fig):
artists = fig.get_children()
for ax in fig.axes:
artists.extend(ax.get_children())
for artist in artists:
try:
artist.remove()
except NotImplementedError:
pass
def plot(num_points, rm_artists):
fig = Figure()
ax = fig.add_subplot()
ax.plot(rand(num_points))
if rm_artists:
remove_artists(fig)
fig.clear()
def delta_mb_after_plot(num_points, rm_artists):
proc = psutil.Process()
before_mb = proc.memory_info().rss / 1e6
plot(num_points, rm_artists)
after_mb = proc.memory_info().rss / 1e6
gc.collect()
# after_gc_mb = proc.memory_info().rss / 1e6
return after_mb - before_mb
def main():
gc.disable()
baseline = delta_mb_after_plot(1_000_000, False)
with_rm_artists = delta_mb_after_plot(1_000_000, True)
print(f"1 million points: {round(baseline - with_rm_artists, 3)}MB")
baseline = delta_mb_after_plot(10_000_000, False)
with_rm_artists = delta_mb_after_plot(10_000_000, True)
print(f"10 million points: {round(baseline - with_rm_artists, 3)}MB")
if __name__ == "__main__":
main()
Running the benchmark before this change prints (showing that the workaround breaks the cycle)
1 million points: 48.263MB
10 million points: 479.572MB
And after (showing that the PR behaves similarly to the workaround)
1 million points: 0.242MB
10 million points: -0.606MB
PR checklist
- [x] [N/A I don't have an issue to link to] "closes #0000" is in the body of the PR description to link the related issue
- [x] [Test added] new and changed code is tested
- [x] [N/A No new features] Plotting related features are demonstrated in an example
- [x] [N/A No new features or API changes] New Features and API Changes are noted with a directive and release note
- [x] [N/A No documentation changes needed] Documentation complies with general and docstring guidelines
Thank you for this. The number of loops we have is known problem that we are always open to improve! The reason for this bit of code is exactly this (to break the cycle between the Artist and its parents).
I think we are doing it with the swap like this to reduce the amount of time are are in an invalid state (we say that we are not thread safe, but that does not mean we should intentionally drive to invalid states). The invalid state I'm worried about here is the case where there are artists in the children that have their parents cleared which will cause draws to fail. I also have a concern that there maybe other references to the list object floating around.
I would propose a slight variation on this change where we keep the swap, but add old_children.clear() which I think still breaks the cycle that needs to be broken with a smaller change.
Should we also set chld._remove_method = None inside the loop before we clear the list? I'm asking because someone could still have a reference to the child and try to call remove on it. Which will raise ValueError after the Axes have been cleared, which is new behavior.
Should we also set chld._remove_method = None inside the loop before we clear the list?
Yes please.
Looks like the new test is failing on python v3.13.
Relevant logs:
Fatal Python error: Segmentation fault
Current thread 0x00007fefd9bdab80 (most recent call first):
File "/home/runner/work/matplotlib/matplotlib/lib/matplotlib/tests/test_axes.py", line 9201 in is_in_reference_cycle
File "/home/runner/work/matplotlib/matplotlib/lib/matplotlib/tests/test_axes.py", line 9226 in test_axes_clear_reference_cycle
I wonder if the GC is running at the same time as is_in_reference_cycle? I'll try gc.disable() during the graph traversal and see if that changes anything
I think you have found a CPython bug as you should not be able to segfault Python from Python.
and it looks like the scatter got dropped from the test at some point.
and it looks like the scatter got dropped from the test at some point.
I still see ax.scatter on line 9216. I wonder if Github is having trouble syncing?
I think you have found a CPython bug as you should not be able to segfault Python from Python.
I'll work on a minimal repro to submit to CPython then! What should we do with this test and this pull request while that's happening?
I decided to skip the test on non-final versions of 3.13.0 because https://github.com/python/cpython/issues/124538 was marked as a release blocker. So, we should be good to re-enable this test on 3.13.0-final and later.
There are two failures on the most recent test run:
- appveyor + micromamba seems unrelated to my change. Is that failing on main too?
-
code coverage check is complaining that I added an unreached line in the test. The unreached line is the
return Trueinis_in_reference_cyclethat would cause the test to fail. So it's a good thing that line is not reached! This seems like a benign thing to ignore, to me.
What are we waiting for to merge this? Is there anything I need to do? The only failing test (appVeyor) seems unrelated to my changes and I can see it failing on other PRs. Has it been fixed on main yet? Should I pull latest main again?