arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Dense Union incompatible between Julia/Python

Open baumgold opened this issue 3 years ago • 2 comments

When writing a table with Arrow.jl that contains a nullable column, the Arrow data cannot be read by Pyarrow:

julia> Arrow.write("/tmp/nothing.arrow", (col=Vector{Union{Nothing,Int32}}([1,2,3,nothing]),))
"/tmp/nothing.arrow"

julia> Arrow.Table("/tmp/nothing.arrow")
Arrow.Table with 4 rows, 1 columns, and schema:
 :col  Union{Missing, Nothing, Int32}
In [9]: df = pandas.read_feather("/tmp/nothing.arrow")
-----------------------------------------------------------------
ArrowNotImplementedError        Traceback (most recent call last)
<ipython-input-9-9b9a515df158> in <module>
----> 1 df = pandas.read_feather("/tmp/nothing.arrow")

~/miniconda3/lib/python3.9/site-packages/pandas/io/feather_format.py in read_feather(path, columns, use_threads, storage_options)
    128     ) as handles:
    129
--> 130         return feather.read_feather(
    131             handles.handle, columns=columns, use_threads=bool(use_threads)
    132         )

~/miniconda3/lib/python3.9/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
    218     """
    219     _check_pandas_version()
--> 220     return (read_table(source, columns=columns, memory_map=memory_map)
    221             .to_pandas(use_threads=use_threads))
    222

~/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()

~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()

~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    787     _check_data_column_metadata_consistency(all_columns)
    788     columns = _deserialize_column_index(table, all_columns, column_indexes)
--> 789     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
    790
    791     axes = [columns, index]

~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, categories, extension_columns)
   1126     # Convert an arrow table to Block from the internal pandas API
   1127     columns = block_table.column_names
-> 1128     result = pa.lib.table_to_blocks(options, block_table, categories,
   1129                                     list(extension_columns.keys()))
   1130     return [_reconstruct_block(item, columns, extension_columns)

~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in pyarrow.lib.table_to_blocks()

~/miniconda3/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of type dense_union<: null=0, : int32 not null=1> is known.

Note that when using Missing instead of Nothing Pyarrow can read the data written by Arrow.jl.

julia> Arrow.write("/tmp/missing.arrow", (col=Vector{Union{Missing,Int32}}([1,2,3,missing]),))
"/tmp/missing.arrow"
In [1]: pandas.read_feather("/tmp/missing.arrow")
Out[1]:
   col
0  1.0
1  2.0
2  3.0
3  NaN

baumgold avatar Feb 16 '22 21:02 baumgold

As a work-around all nullable types can be converted to use Missing instead of Nothing, which seems to allow Python to read the Arrow files generated by Arrow.jl. The issue is now we cannot distinguish between nothing (converted to missing) and actual missing.

julia> ArrowTypes.ArrowType(::Type{Union{Nothing,T}}) where {T} = Union{Missing,ArrowTypes.ArrowType(T)}

julia> Arrow.write("/tmp/nothing.arrow", (col=Vector{Union{Nothing,Int32}}([1,2,3,nothing]),))
"/tmp/nothing.arrow"

julia> Arrow.Table("/tmp/nothing.arrow")
Arrow.Table with 4 rows, 1 columns, and schema:
 :col  Union{Missing, Int32}
In [1]: pandas.read_feather("/tmp/nothing.arrow")
Out[1]:
   col
0  1.0
1  2.0
2  3.0
3  NaN

baumgold avatar Feb 17 '22 01:02 baumgold

Related to #258 and ARROW-15767

baumgold avatar Feb 17 '22 01:02 baumgold