io icon indicating copy to clipboard operation
io copied to clipboard

Binary type not supported by tensorflow_io.arrow

Open cyc opened this issue 4 years ago • 3 comments

It would be good if tensorflow_io.arrow could support a broader variety of data types. It seems that there is currently no support for the Binary arrow type yet.

Reproducible example: pyarrow version: 2.0.0 tensorflow_io: 0.17.0

import tensorflow_io.arrow as arrow_io
import pyarrow
import tensorflow as tf

arr = pyarrow.array([b'a', b'bb', b'ccc'])
table = pyarrow.Table.from_arrays([arr], ['arr'])
print(table.schema)
ads = arrow_io.ArrowDataset.from_record_batches(
    table.to_batches(),
    output_types=(tf.string,),
    output_shapes=(tf.TensorShape(None),),
    batch_size=1,
    batch_mode='drop_remainder')
dd = next(iter(ads))

Results in:

arr: binary

tensorflow.python.framework.errors_impl.InternalError: Invalid: Invalid argument: arrow data type 0x7ff7a8457388 is not supported: Type error: Arrow data type is not supported [Op:IteratorGetNext]

cyc avatar Apr 13 '21 20:04 cyc

This should be pretty straight-forward to add. String types are already supported, and those are just binary arrays in Arrow.

BryanCutler avatar Apr 13 '21 20:04 BryanCutler

@BryanCutler is there a way to mitigate the error currently?

sayakpaul avatar Jun 25 '22 19:06 sayakpaul

Is this the place where binary type support should be added ? Can you provide some pointers if possible ?

https://github.com/tensorflow/io/blob/f31422e0eeb08e6336411009d316ff9d0d36edf1/tensorflow_io/core/kernels/arrow/arrow_kernels.cc#L620-L626

lhoestq avatar Jul 26 '22 09:07 lhoestq