arrow-julia
arrow-julia copied to clipboard
Dense Union incompatible between Julia/Python
When writing a table with Arrow.jl that contains a nullable column, the Arrow data cannot be read by Pyarrow:
julia> Arrow.write("/tmp/nothing.arrow", (col=Vector{Union{Nothing,Int32}}([1,2,3,nothing]),))
"/tmp/nothing.arrow"
julia> Arrow.Table("/tmp/nothing.arrow")
Arrow.Table with 4 rows, 1 columns, and schema:
:col Union{Missing, Nothing, Int32}
In [9]: df = pandas.read_feather("/tmp/nothing.arrow")
-----------------------------------------------------------------
ArrowNotImplementedError Traceback (most recent call last)
<ipython-input-9-9b9a515df158> in <module>
----> 1 df = pandas.read_feather("/tmp/nothing.arrow")
~/miniconda3/lib/python3.9/site-packages/pandas/io/feather_format.py in read_feather(path, columns, use_threads, storage_options)
128 ) as handles:
129
--> 130 return feather.read_feather(
131 handles.handle, columns=columns, use_threads=bool(use_threads)
132 )
~/miniconda3/lib/python3.9/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
218 """
219 _check_pandas_version()
--> 220 return (read_table(source, columns=columns, memory_map=memory_map)
221 .to_pandas(use_threads=use_threads))
222
~/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()
~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()
~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
787 _check_data_column_metadata_consistency(all_columns)
788 columns = _deserialize_column_index(table, all_columns, column_indexes)
--> 789 blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
790
791 axes = [columns, index]
~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, categories, extension_columns)
1126 # Convert an arrow table to Block from the internal pandas API
1127 columns = block_table.column_names
-> 1128 result = pa.lib.table_to_blocks(options, block_table, categories,
1129 list(extension_columns.keys()))
1130 return [_reconstruct_block(item, columns, extension_columns)
~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in pyarrow.lib.table_to_blocks()
~/miniconda3/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of type dense_union<: null=0, : int32 not null=1> is known.
Note that when using Missing instead of Nothing Pyarrow can read the data written by Arrow.jl.
julia> Arrow.write("/tmp/missing.arrow", (col=Vector{Union{Missing,Int32}}([1,2,3,missing]),))
"/tmp/missing.arrow"
In [1]: pandas.read_feather("/tmp/missing.arrow")
Out[1]:
col
0 1.0
1 2.0
2 3.0
3 NaN
As a work-around all nullable types can be converted to use Missing instead of Nothing, which seems to allow Python to read the Arrow files generated by Arrow.jl. The issue is now we cannot distinguish between nothing (converted to missing) and actual missing.
julia> ArrowTypes.ArrowType(::Type{Union{Nothing,T}}) where {T} = Union{Missing,ArrowTypes.ArrowType(T)}
julia> Arrow.write("/tmp/nothing.arrow", (col=Vector{Union{Nothing,Int32}}([1,2,3,nothing]),))
"/tmp/nothing.arrow"
julia> Arrow.Table("/tmp/nothing.arrow")
Arrow.Table with 4 rows, 1 columns, and schema:
:col Union{Missing, Int32}
In [1]: pandas.read_feather("/tmp/nothing.arrow")
Out[1]:
col
0 1.0
1 2.0
2 3.0
3 NaN
Related to #258 and ARROW-15767