pyopencl icon indicating copy to clipboard operation
pyopencl copied to clipboard

Nanobind leak warnings

Open matthiasdiener opened this issue 1 year ago • 9 comments

Describe the bug At the end of execution, what appears to be SVM-related pyopencl code causes nanobind leak warnings to appear:

nanobind: leaked 3 instances!
 - leaked instance 0x1010fdf88 of type "SVMAllocator"
 - leaked instance 0x1076bd3c8 of type "SVMPool"
 - leaked instance 0x101459258 of type "Context"
nanobind: leaked 3 types!
 - leaked type "pyopencl._cl.SVMAllocator"
 - leaked type "pyopencl._cl.Context"
 - leaked type "pyopencl._cl.SVMPool"
nanobind: leaked 20 functions!
 - leaked function ""
 - leaked function ""
 - leaked function ""
 - leaked function "__eq__"
 - leaked function ""
 - leaked function "alloc_size"
 - leaked function "free_held"
 - leaked function "__init__"
 - leaked function ""
 - leaked function "__call__"
 - leaked function "__init__"
 - ... skipped remainder
nanobind: this is likely caused by a reference counting issue in the binding code.

To Reproduce

  1. $ cd pyopencl
  2. $ python examples/demo_array_svm.py (causes leak warnings)
  3. $ python examples/demo_array.py (does not cause leak warnings)

Edit: Easiest way to reproduce:

$ python -c 'import pyopencl as cl; ctx = cl.create_some_context(); f=cl.SVMAllocation(ctx, 10, 0, 0)'
nanobind: leaked 1 instances!
 - leaked instance 0x106203858 of type "Context"
nanobind: leaked 1 types!
 - leaked type "pyopencl._cl.Context"
nanobind: leaked 7 functions!
 - leaked function "__eq__"
 - leaked function "from_int_ptr"
 - leaked function "set_default_device_command_queue"
 - leaked function "__hash__"
 - leaked function ""
 - leaked function "get_info"
 - leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.

(when not assigning to f, the warnings do not appear)

Expected behavior Don't show the warnings.

Environment (please complete the following information):

  • OS: MacOS
  • Python version: 3.11.9
  • PyOpenCL version: 2024.2.2

Additional context Note that #755 does not address this issue.

matthiasdiener avatar May 17 '24 15:05 matthiasdiener

Thanks for the small reproducer, that helps! I think part of the problem here is that SVMAllocation holds a reference to the CL context in a shared_ptr:

https://github.com/inducer/pyopencl/blob/f5e1b7b7acdf5beaf3d9435163ad4d099e36dd7a/src/wrap_cl.hpp#L3700

It needs to hold a reference, because it needs the context to free the SVM:

https://github.com/inducer/pyopencl/blob/f5e1b7b7acdf5beaf3d9435163ad4d099e36dd7a/src/wrap_cl.hpp#L3786

Nanobind's support for shared_ptr increases the context's reference count, and the behavior is overall quite complex (see part 1, part 2). While I feel that the context should still get GC'd (especially after the SVMAllocation is gone, which it must be, because it isn't getting leaked), that somehow doesn't seem to be happening. I'm not sure I have a theory on the precise mechanism.

As an alternative to the complexity of shared ownership between C++ and Python, nanobind also has intrusive reference counting. Maybe that's worth a try, for all the cases where currently shared_ptrs are being used.

inducer avatar May 17 '24 21:05 inducer

Actually, we may be dealing with what's described as the "uncollectable inter-language reference cycle" in nanobind's docs for intrusive RC.

inducer avatar May 17 '24 21:05 inducer

I feel like we should move to intrusive refcounting for context, queue, memory pools, and allocators. I'm out of time for today, but if you have time to get this started, I'd be open to looking at it.

inducer avatar May 17 '24 22:05 inducer

#759 is a start that already addresses your small reproducer.

inducer avatar May 19 '24 04:05 inducer

There still appear to be leak warnings (see e.g. https://github.com/inducer/pyopencl/actions/runs/9154012159/job/25163854069), but I can't reproduce them locally at the moment.

matthiasdiener avatar May 20 '24 16:05 matthiasdiener

Thanks for spotting that! For posterity:

 nanobind: leaked 7 instances!
 - leaked instance 0x55eb67f303f0 of type "pyopencl._cl.Device"
 - leaked instance 0x55eb67f697e0 of type "pyopencl._cl.Device"
 - leaked instance 0x55eb680ec850 of type "pyopencl._cl.Device"
 - leaked instance 0x55eb67edd4b0 of type "pyopencl._cl.Device"
 - leaked instance 0x55eb67929de0 of type "pyopencl._cl.Platform"
 - leaked instance 0x55eb67c9e200 of type "pyopencl._cl.Device"
 - leaked instance 0x55eb6c9cbe50 of type "pyopencl._cl.Platform"
nanobind: leaked 18 types!
 - leaked type "pyopencl._cl.Device"
 - leaked type "pyopencl._cl.SVMAllocator"
 - leaked type "pyopencl._cl.SVMPointer"
 - leaked type "pyopencl._cl.CommandQueue"
 - leaked type "pyopencl._cl.PooledBuffer"
 - leaked type "pyopencl._cl.ImmediateAllocator"
 - leaked type "pyopencl._cl.AllocatorBase"
 - leaked type "pyopencl._cl.Event"
 - leaked type "pyopencl._cl.Buffer"
 - leaked type "pyopencl._cl.Kernel"
 - leaked type "pyopencl._cl.SVMPool"
 - ... skipped remainder
nanobind: leaked 109 functions!
 - leaked function "__init__"
 - leaked function ""
 - leaked function "__call__"
 - leaked function "__call__"
 - leaked function "unbind_from_queue"
 - leaked function "__hash__"
 - leaked function "_set_arg_svm"
 - leaked function "from_int_ptr"
 - leaked function ""
 - leaked function "enqueue_release"
 - leaked function "_set_arg_buf_pack_multi"
 - ... skipped remainder
nanobind: this is likely caused by a reference counting issue in the binding code.

We should track those down sometime.

inducer avatar May 20 '24 17:05 inducer

Could we disable these messages, e.g. by giving access to [nb::set_leak_warnings()] ?(https://nanobind.readthedocs.io/en/latest/faq.html#why-am-i-getting-errors-about-leaked-functions-and-types)

vincefn avatar May 31 '24 06:05 vincefn

@vincefn The incidence of these should be substantially reduced post b659cad93a6f97535ea7fdd1c0c7c7e9b71f7a3e (unreleased). It might be good to disable these for release builds by default, because they don't have much utility from the user's end.

inducer avatar May 31 '24 13:05 inducer

One way to reproduce the warnings after b659cad is:

$ python examples/demo-struct-reduce.py
nanobind: leaked 1 instances!
 - leaked instance 0xaaaadcaedf00 of type "pyopencl._cl.Device"
nanobind: leaked 17 types!
 - leaked type "pyopencl._cl.Device"
 - leaked type "pyopencl._cl.SVMAllocator"
 - leaked type "pyopencl._cl.SVMPointer"
 - leaked type "pyopencl._cl.CommandQueue"
 - leaked type "pyopencl._cl.PooledBuffer"
 - leaked type "pyopencl._cl.ImmediateAllocator"
 - leaked type "pyopencl._cl.AllocatorBase"
 - leaked type "pyopencl._cl.Event"
 - leaked type "pyopencl._cl.Buffer"
 - leaked type "pyopencl._cl.Kernel"
 - leaked type "pyopencl._cl.SVMPool"
 - ... skipped remainder
nanobind: leaked 103 functions!
 - leaked function "device_and_host_timer"
 - leaked function "get_cl_header_version"
 - leaked function "from_int_ptr"
 - leaked function "_set_arg_multi"
 - leaked function ""
 - leaked function ""
 - leaked function "set_arg"
 - leaked function "__init__"
 - leaked function "from_int_ptr"
 - leaked function "get_host_array"
 - leaked function "get_work_group_info"
 - ... skipped remainder
nanobind: this is likely caused by a reference counting issue in the binding code.

For some reason, these do not show up on my Mac, just on Linux.

matthiasdiener avatar May 31 '24 16:05 matthiasdiener