matplotlib icon indicating copy to clipboard operation
matplotlib copied to clipboard

Break Artist._remove_method reference cycle

Open justinjhendrick opened this issue 1 year ago • 11 comments

PR summary

I think I've found a reference cycle that can consume quite a bit of memory. The size of the reachable bytes from the reference cycle seems to grow linearly with the number of points I've plotted and Figure.clear() does not remove them. gc.collect() would remove this memory, but in performance critical applications, waiting for the GC is no fun.

The reference cycle looks like this:

Axes -> _children -> Artist (Line2D for example)
            ^           |
             \          | field: "_remove_method"
              \         v
             _children.remove               

(credit to refcycle for helping me find it)

This PR breaks the cycle by emptying out the Axes._children list instead of replacing it with an empty list. This removes the reference from the list to the Artist, allowing the reference count to reach 0 on the Artist.

I've found this method to be an effective workaround (without modifying matplotlib)

def real_clear(fig):
    artists = fig.get_children()
    for ax in fig.axes:
        artists.extend(ax.get_children())
    for artist in artists:
        try:
            artist.remove()
        except NotImplementedError:
            pass
    fig.clear()

But I would like to contribute this memory usage improvement back into matplotlib. Though I'm not sure the best way. This PR is the smallest and simplest change, but I'm not certain if it covers enough similar reference cycles or just the one that showed up in my profiling. I also don't know if this is a clean way to solve the problem because I'm not very familiar with matplotlib internals. I considered making Artist._remove_method a weakref but that seems like a larger change that I'm not ready for (yet?). How do you think we should approach this reference cycle?

Note that this PR branches from https://github.com/matplotlib/matplotlib/commit/a47e26bd8583f0fcacc1dd83c3f2ac39b1f7a091 (not latest main) because of the issues I encountered with the next commit (See https://github.com/matplotlib/matplotlib/issues/28866 and https://github.com/matplotlib/matplotlib/commit/597554db667344c4c4ad026ec1c6dd5f4c688d8f#commitcomment-147024800).

Here is the benchmark script that I've been using to evaluate the effectiveness of these changes:

import gc

import matplotlib
matplotlib.use("Agg")

from matplotlib.figure import Figure
from numpy.random import rand
import psutil

def remove_artists(fig):
    artists = fig.get_children()
    for ax in fig.axes:
        artists.extend(ax.get_children())
    for artist in artists:
        try:
            artist.remove()
        except NotImplementedError:
            pass

def plot(num_points, rm_artists):
    fig = Figure()
    ax = fig.add_subplot()
    ax.plot(rand(num_points))
    if rm_artists:
        remove_artists(fig)
    fig.clear()

def delta_mb_after_plot(num_points, rm_artists):
    proc = psutil.Process()
    before_mb = proc.memory_info().rss / 1e6
    plot(num_points, rm_artists)
    after_mb = proc.memory_info().rss / 1e6
    gc.collect()
    # after_gc_mb = proc.memory_info().rss / 1e6
    return after_mb - before_mb

def main():
    gc.disable()

    baseline = delta_mb_after_plot(1_000_000, False)
    with_rm_artists = delta_mb_after_plot(1_000_000, True)
    print(f"1 million points: {round(baseline - with_rm_artists, 3)}MB")

    baseline = delta_mb_after_plot(10_000_000, False)
    with_rm_artists = delta_mb_after_plot(10_000_000, True)
    print(f"10 million points: {round(baseline - with_rm_artists, 3)}MB")

if __name__ == "__main__":
    main()

Running the benchmark before this change prints (showing that the workaround breaks the cycle)

1 million points: 48.263MB
10 million points: 479.572MB

And after (showing that the PR behaves similarly to the workaround)

1 million points: 0.242MB
10 million points: -0.606MB

PR checklist

  • [x] [N/A I don't have an issue to link to] "closes #0000" is in the body of the PR description to link the related issue
  • [x] [Test added] new and changed code is tested
  • [x] [N/A No new features] Plotting related features are demonstrated in an example
  • [x] [N/A No new features or API changes] New Features and API Changes are noted with a directive and release note
  • [x] [N/A No documentation changes needed] Documentation complies with general and docstring guidelines

justinjhendrick avatar Sep 21 '24 22:09 justinjhendrick

Thank you for this. The number of loops we have is known problem that we are always open to improve! The reason for this bit of code is exactly this (to break the cycle between the Artist and its parents).

I think we are doing it with the swap like this to reduce the amount of time are are in an invalid state (we say that we are not thread safe, but that does not mean we should intentionally drive to invalid states). The invalid state I'm worried about here is the case where there are artists in the children that have their parents cleared which will cause draws to fail. I also have a concern that there maybe other references to the list object floating around.

I would propose a slight variation on this change where we keep the swap, but add old_children.clear() which I think still breaks the cycle that needs to be broken with a smaller change.

tacaswell avatar Sep 23 '24 10:09 tacaswell

Should we also set chld._remove_method = None inside the loop before we clear the list? I'm asking because someone could still have a reference to the child and try to call remove on it. Which will raise ValueError after the Axes have been cleared, which is new behavior.

justinjhendrick avatar Sep 23 '24 16:09 justinjhendrick

Should we also set chld._remove_method = None inside the loop before we clear the list?

Yes please.

tacaswell avatar Sep 24 '24 11:09 tacaswell

Looks like the new test is failing on python v3.13.

Relevant logs:

Fatal Python error: Segmentation fault

Current thread 0x00007fefd9bdab80 (most recent call first):
  File "/home/runner/work/matplotlib/matplotlib/lib/matplotlib/tests/test_axes.py", line 9201 in is_in_reference_cycle
  File "/home/runner/work/matplotlib/matplotlib/lib/matplotlib/tests/test_axes.py", line 9226 in test_axes_clear_reference_cycle

I wonder if the GC is running at the same time as is_in_reference_cycle? I'll try gc.disable() during the graph traversal and see if that changes anything

justinjhendrick avatar Sep 24 '24 17:09 justinjhendrick

I think you have found a CPython bug as you should not be able to segfault Python from Python.

tacaswell avatar Sep 25 '24 07:09 tacaswell

and it looks like the scatter got dropped from the test at some point.

tacaswell avatar Sep 25 '24 07:09 tacaswell

and it looks like the scatter got dropped from the test at some point.

I still see ax.scatter on line 9216. I wonder if Github is having trouble syncing?

justinjhendrick avatar Sep 25 '24 14:09 justinjhendrick

I think you have found a CPython bug as you should not be able to segfault Python from Python.

I'll work on a minimal repro to submit to CPython then! What should we do with this test and this pull request while that's happening?

justinjhendrick avatar Sep 25 '24 14:09 justinjhendrick

I decided to skip the test on non-final versions of 3.13.0 because https://github.com/python/cpython/issues/124538 was marked as a release blocker. So, we should be good to re-enable this test on 3.13.0-final and later.

justinjhendrick avatar Sep 26 '24 03:09 justinjhendrick

There are two failures on the most recent test run:

  • appveyor + micromamba seems unrelated to my change. Is that failing on main too?
  • code coverage check is complaining that I added an unreached line in the test. The unreached line is the return True in is_in_reference_cycle that would cause the test to fail. So it's a good thing that line is not reached! This seems like a benign thing to ignore, to me.

justinjhendrick avatar Sep 27 '24 15:09 justinjhendrick

What are we waiting for to merge this? Is there anything I need to do? The only failing test (appVeyor) seems unrelated to my changes and I can see it failing on other PRs. Has it been fixed on main yet? Should I pull latest main again?

justinjhendrick avatar Oct 03 '24 15:10 justinjhendrick