numpy Encoding an empty unicode array would produce an array of the wrong dtype

Calling numpy.char.encode on empty unicode array would create a float64 array instead of an array of S dtype.

Reproducing code example:

import numpy
print(numpy.char.encode(numpy.array([], 'U'), 'utf8').dtype)
# This would output:
# float64

I would expect an empty S1 array.

Error message:

The dtype returned seems wrong.

Numpy/Python version information:

>>> import sys, numpy; print(numpy.__version__, sys.version)
1.16.2 3.7.2 (default, Dec 29 2018, 06:19:36)
[GCC 7.3.0]

This is run on a conda environment (I just did a "conda create -n test_numpy python=3.7 numpy"). The problem seems to exist in earlier numpy as well (1.15).

Mar 18 '19 21:03 will133

The shape also seems to get messed up. I.e.:

numpy.char.encode(numpy.array([], 'U').reshape((1, 0, 1)), 'utf8').shape)

Prints (1, 0) instead of the original shape.

Nov 08 '19 14:11 newt0311

Decode is also affected by this bug btw.

Dec 05 '19 13:12 newt0311

The bug is in _to_string_or_unicode_array, which impacts all of:

mod
decode
encode
expandtabs
join
partition
replace
rpartition

The fix is probably to work out the correct type ahead of time, rather than guessing from the array contents.

Dec 05 '19 14:12 eric-wieser

This stackoverflow question is another report of the bug: Why does numpy's np.char.encode turn an empty unicode array into an empty float64 array?

Jul 19 '22 14:07 WarrenWeckesser

Here's an older issue that reports the same problem: https://github.com/numpy/numpy/issues/7371

Jul 19 '22 14:07 WarrenWeckesser