numpy Pickling/unpickling a dtype resets isbuiltin flag

This just happened to me with numpy 1.8.0 (from Debian):

>>> import numpy as np
>>> x = np.dtype(np.float32)
>>> from pickle import dumps, loads
>>> y = loads(dumps(x))
>>> x.isbuiltin
1
>>> y.isbuiltin
0

Feb 18 '14 06:02 inducer

Note that this also appears to change the hash of the dtype, post-unpickling.

Feb 18 '14 17:02 inducer

Still in 1.10.4.

Mar 16 '16 04:03 inducer

I guess this hasn't been a priority because, while probably a bug, it seems rather trivial so it's unlikely to make it to the top of anyone's todo list... but if you are checking on it two years later than maybe that assessment is wrong? Does it matter somehow?

(I can see how the hash issue could cause problems, but I can't reproduce that on 1.11.0rc1:

In [15]: x = np.dtype(np.float32)

In [16]: y = loads(dumps(x))

In [17]: hash(x) == hash(y)
Out[17]: True

In [18]: x == y
Out[18]: True

)

Mar 16 '16 04:03 njsmith

Still in 1.11.0

I just came across this issue after spending some time debugging a weird caching issue.

I feed the data into a graphics library which checks whether the data fed to it is a builtin type. Since this fails, the application crashes.

To make things worse, there seems to be no simple way to convert the data type to a builtin type again. The only reliable solution I could find is to convert the array into a list and then create a new array which is slow.

Sep 20 '16 00:09 kaymes

Still exists

Feb 14 '19 20:02 mattip

The documentation seems to be too terse. np.dtype('S0').isbuiltin is 1 but np.dtype('S10').isbuiltin is 0. What is that attribute meant to convey?

Feb 14 '19 20:02 mattip

One of our vispy users (see https://github.com/vispy/vispy/issues/1741) has run in to this issue when trying to provide pickled data to vispy. Vispy does some checks on the provided data arrays including checking the isbuiltin flag.

I tried looking for the various __getstate__/__setstate__ methods that would apply to the dtype object being saved, but couldn't pick decipher the numpy source. If one of the numpy devs could point me to the relevant code I can try to find time to track this down.

Jan 15 '20 02:01 djhoese

I think you may be looking for arraydescr_reduce and arraydescr_setstate in numpy/core/src/multiarray/descriptor.c. Make sure there are new tests in numpy/core/tests/test_dtype.py

Jan 15 '20 04:01 mattip

Thanks @mattip. I have a guess but won't have time to test things out right now. I'm a little confused by the use of NULL versus Py_None in the descriptor.c file. In the isbuiltin getter we see:

https://github.com/numpy/numpy/blob/9656e4c00d52d9583c577dbb892a090abaa3dd99/numpy/core/src/multiarray/descriptor.c#L1968-L1970

And this == Py_None check is seen other places in this file, but there are also checks that do PyDataType_HASFIELDS(self) which is a check if self->names == NULL. Then in other places of this file there are checks if self->fields == NULL. Some of the other usages of == Py_None against the fields make sense (I think) because numpy is taking what the user gave and converting it to NULL:

https://github.com/numpy/numpy/blob/9656e4c00d52d9583c577dbb892a090abaa3dd99/numpy/core/src/multiarray/descriptor.c#L1745-L1747

It's a start.

Jan 15 '20 15:01 djhoese

According to:

https://github.com/numpy/numpy/blob/dbcdeb971f15e0f19f6745409462bee7972e5bbf/numpy/core/include/numpy/ndarraytypes.h#L624-L628

fields should/could be Py_None...I'm a little worried NULL versus Py_None are two valid but different cases.

Jan 15 '20 15:01 djhoese

I think what might be is that None should never be set. However, datatypes are currently weird, and there is basically a single prototype instance around that gets copied (and sometimes used). I somewhat think that None is basically used to indicate that the specific instance is the prototype instance and needs copying.

Now, after the copying though, you have to clear the None, since copying it would be incorrect... It is all pretty complex logic.

In any case, using this knowledge, the code for arraydescr_isbuiltin_get should probably be adapted (and quite frankly removing that whole None business could be good). There is a list of all builtin descriptors around, so the code could be adapted to check against that maybe, OTOH that probably does not solve the actual problem here: unpickling returns a new descriptor, but some code expects that e.g. np.dtype(np.float32) is a singleton instance when it is only a singleton instance in most cases.

Jan 15 '20 16:01 seberg

In any case, my complaints about this being all a bit strange are unrelated to the issue. isbuiltin tries (probably succeeds) to tell you whether you got the default singleton instance.

Unpickling, however, has a whole different issue about this. If we really want to the singleton instance to be returned when unpickling, then we cannot use __setstate__, but instead must use __getnewargs__. I suppose the actual code does not need to change all that much and might actually simplify a bit, since it only needs to prepare the input for the np.dtype() call?

I am not 100% sure what happens with pickle version 0 and 1 (i.e. whether __getnewargs__ or __getnewargs_ex__ is actually called for those).

The angle is to check whether the isbuiltin call there is actually useful...

Jan 15 '20 16:01 seberg

Sorry for the long story, but it seems to me:

isbuiltin sounds (and the documentation reads) like it will return True for when it is built into numpy (maybe aside from structured DTypes)
isbuiltin in fact returns 2 for user defined dtypes and will return 0 for dtypes such as np.dtype(">float64")!
isbuiltin also returns 0 if you attach metadata to a float64 or so (which may be OK)
I would not trust isbuiltin to be generally reliable. The unpickling is only one such case! We could clean up numpy in this regard probably, but downstream can break this as well (easily)!

This is because the actual logic includes "is this the (static) prototype instance", which is so obscure that nobody should even have to understand what that means.

in summary: I would be in favor of introducing a new method which does what vispy actually wants to achieve here (whatever that is) and deprecate the current method, because it seems brittle (and quite frankly useless). Alternatively, we just deprecate it, chances are vispy does not even need it...

Edit: To clarify "is this the (static) prototype instance", just read it as the "singleton instance", but we do not guarantee very well that np.dtype(np.float64) is really a singleton.

Jan 15 '20 16:01 seberg

Hmmm, scipy and numba do not actually use either PyArray_DescrNewFromType or PyArray_DescrFromType or even PyArray_DescrNew maybe we can get away with deprecating at least the second (someone who uses the New versions may be tempted to mess around with the dtype fields. This is the only reason to use the New versions, in fact any place that does not do that should not be using the New version at all!)

EDIT: Astropy only uses the first one, and not the New versions, as well.

Jan 15 '20 17:01 seberg

Alternatively, we just deprecate it, chances are vispy does not even need it...

That's what I'm thinking. I should really try to figure out what vispy was originally trying to do when this if statement was added. If it is most concerned with there being fields then I will just check the fields attribute directly. It could have been copied from one of the "parent" projects that vispy was created from.

Thanks for all the info. By the sounds of it this won't be something a non-core dev will "fix" and it doesn't sound like it is something that should be fixed.

Jan 15 '20 17:01 djhoese

There are 2-3 related issues here ;). Making the pickling return a singleton issue can be fixed by anyone who knows the C-API well enough I think. Cleaning up the rest of NumPy could be attempted, but probably needs pointers (and good python C-API knowledge again). (EDIT: to be clear, I am not actually sure it is worth the trouble as of now).

But deprecating/removing isbuiltin can be done by anyone, it is a fairly simple pull request!

Jan 15 '20 17:01 seberg

FWIW, I don't see any workaround posted here. astype(float) or something similar should fix it:

In [1]: import numpy as np

In [2]: from pickle import dumps, loads

In [3]: x = np.zeros(1, dtype=float)

In [4]: y = loads(dumps(x))

In [5]: z = y.astype(float)

In [6]: x.dtype.isbuiltin
Out[6]: 1

In [7]: y.dtype.isbuiltin
Out[7]: 0

In [8]: z.dtype.isbuiltin
Out[8]: 1

Jul 24 '23 13:07 namheegordonkim

I can reproduce this issue with copy.copy() or `copy.deepcopy()

Feb 20 '24 15:02 dpinol