sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Disable numba caching via environment variable

Open timothymillar opened this issue 3 years ago • 2 comments

Edit: related to #371

I've recently started experimenting with sgkit on a SLURM cluster which is working well with the exception of methods using guvectorize with cache=True. Calling these functions results in a segmentation fault on the worker. This only seems to be an issue with guvectorize (not the jit or vectorize decorators) and there is no segmentation fault if I set cache=False.

There are a couple of open issues that may be related although neither quite match what I'm seeing (need to dig some more):

  • https://github.com/dask/distributed/issues/3450
  • https://github.com/numba/numba/issues/4807

There is also an open issue for globally disabling numba caching which would provide a workaround although it might be stale:

  • https://github.com/numba/numba/issues/4549

In the meantime, for the sake of debugging and workarounds, it'd be useful to be able to disable numba-caching in sgkit using an environment variable.

timothymillar avatar Jul 03 '22 23:07 timothymillar

Can this be closed now that #870 is in?

tomwhite avatar Aug 02 '22 16:08 tomwhite

Maybe we should leave it open for now to document the SGKIT_DISABLE_NUMBA_CACHE variable. I also wondered if you had a suggestion for testing that setting that variable works as expected in CI?

timothymillar avatar Aug 02 '22 21:08 timothymillar

I've hit this via #1051. Interestingly I get a different error (TypeError: can not serialize 'numpy.int64' object) if I disable task fusion in dask (dask.config.set({"optimization.fuse.active": False}))

benjeffery avatar Mar 09 '23 14:03 benjeffery

After much digging I have discovered some interesting things about these segfaults. As above turning off dask task fusion results in the serialization error above. Digging in to the code this is because we are passing numpy.int64 to some dask methods instead of int. For example if I change:

@wraps(gufunc)
    def func(x: ArrayLike, cohort: ArrayLike, n: int, axis: int = -1) -> ArrayLike:
        x = da.swapaxes(da.asarray(x), axis, -1)

(from cohort_numba_fns.py) to:

@wraps(gufunc)
    def func(x: ArrayLike, cohort: ArrayLike, n: int, axis: int = -1) -> ArrayLike:
        n = int(n)
        axis = int(axis)
        x = da.swapaxes(da.asarray(x), axis, -1)

Then the serialisation error is fixed!

BUT If I then turn dask task fusion back on, the segfault is gone!! So I think that in the fused task a compiled func is expecting an int, but getting a numpy.int64, and then segfaulting?

There are other segfaults still happening - I assume they are for similar issues.

(@jeromekelleher numpy ints strike again!)

benjeffery avatar Mar 09 '23 16:03 benjeffery