Encoding an empty unicode array would produce an array of the wrong dtype
Calling numpy.char.encode on empty unicode array would create a float64 array instead of an array of S dtype.
Reproducing code example:
import numpy
print(numpy.char.encode(numpy.array([], 'U'), 'utf8').dtype)
# This would output:
# float64
I would expect an empty S1 array.
Error message:
The dtype returned seems wrong.
Numpy/Python version information:
>>> import sys, numpy; print(numpy.__version__, sys.version)
1.16.2 3.7.2 (default, Dec 29 2018, 06:19:36)
[GCC 7.3.0]
This is run on a conda environment (I just did a "conda create -n test_numpy python=3.7 numpy"). The problem seems to exist in earlier numpy as well (1.15).
The shape also seems to get messed up. I.e.:
numpy.char.encode(numpy.array([], 'U').reshape((1, 0, 1)), 'utf8').shape)
Prints (1, 0) instead of the original shape.
Decode is also affected by this bug btw.
The bug is in _to_string_or_unicode_array, which impacts all of:
-
mod -
decode -
encode -
expandtabs -
join -
partition -
replace -
rpartition
The fix is probably to work out the correct type ahead of time, rather than guessing from the array contents.
This stackoverflow question is another report of the bug: Why does numpy's np.char.encode turn an empty unicode array into an empty float64 array?
Here's an older issue that reports the same problem: https://github.com/numpy/numpy/issues/7371