[BUG] `cudf.testing.assert_*_equal` raises AssertionError for equivalent `DecimalDtype`d objects
Describe the bug
In [1]: import cudf
In [2]: ser = cudf.Series([1], dtype=cudf.Decimal128Dtype(1))
In [3]: cudf.testing.assert_series_equal(ser, ser)
AssertionError: ColumnBase are different
values are different (100.0 %)
[left]: {"[Decimal('1')]"}
[right]: {"[Decimal('1')]"}
Expected behavior
I would expect no AssertionError.
It appears there's a testing function, dtype_can_compare_equal_to_other, used in column comparisons that over-zealously assumes two objects with DecimalDtypes shouldn't be compared to each other.
Environment overview (please complete the following information)
- Environment location: Bare-metal
- Method of cuDF install: from source
hi @mroeschke
Based on change history
- issues https://github.com/rapidsai/cudf/issues/8513 report bug series equal function deal with
NaNscp.nan. - pr https://github.com/rapidsai/cudf/pull/10011 close issue, and added
dtype_can_compare_equal_to_other - pr https://github.com/rapidsai/cudf/pull/14638 make some No-side-effects refactor.
The changes introduce type checks on DecimalDtype that are not necessary to fix the bug,I think it's over-zealously.
Hypothesis
cupy does not fully implement numpy's asarray method, at least dtype does not support Decimal128Dtype
Reproduce
I try to remove cudf.core.dtypes.DecimalDtype, in fun dtype_can_compare_equal_to_other, so Decimal128Dtype as a numeric dtype and can compare equal to other type.
def assert_column_equal(
...
left.apply_boolean_mask(
left.isnull().unary_operator("not")
).values,
...
cudf/cudf/core/column/column.py
@property
def values(self) -> cupy.ndarray:
"""
Return a CuPy representation of the Column.
"""
if len(self) == 0:
return cupy.array([], dtype=self.dtype)
if self.has_nulls():
raise ValueError("Column must have no nulls.")
return cupy.asarray(self.data_array_view(mode="write"))
will raise
TypeError: Cannot interpret 'Decimal128Dtype(precision=1, scale=0)' as a data type
Reproduce the code example:
import cudf
ser = cudf.Series([1], dtype=cudf.Decimal128Dtype(1))
left = ser._column
left.apply_boolean_mask(left.isnull().unary_operator("not")).values
if numpy
import numpy
obj = left.apply_boolean_mask(left.isnull().unary_operator("not"))
numpy.asarray(obj)
Out[11]:
array(<cudf.core.column.decimal.Decimal128Column object at 0x726ea7de4f70>
[
1
]
dtype: decimal128, dtype=object)
if cupy
import cupy
cupy.asarray(obj)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[14], line 1
----> 1 cupy.asarray(obj)
File ~/Code/cudf/.venv/lib/python3.10/site-packages/cupy/_creation/from_data.py:88, in asarray(a, dtype, order, blocking)
56 def asarray(a, dtype=None, order=None, *, blocking=False):
57 """Converts an object to array.
58
59 This is equivalent to ``array(a, dtype, copy=False, order=order)``.
(...)
86
87 """
---> 88 return _core.array(a, dtype, False, order, blocking=blocking)
File cupy/_core/core.pyx:2408, in cupy._core.core.array()
File cupy/_core/core.pyx:2435, in cupy._core.core.array()
File cupy/_core/core.pyx:2574, in cupy._core.core._array_default()
ValueError: Unsupported dtype object
We'll first need to assert that the dtypes are equivalent then probably use pandas assertion functions instead of cupy/numpy for comparing decimal values