ARROW-14596: [C++][Python] Read table nested struct fields in columns
Part of ARROW-14596, and ARROW-13798
Does not propose to solve selecting from lists in this PR; only supporting dotted paths into arbitrarily nested structs using existing FromDotPath. Selecting from lists will require further discussion and support for kernels to select struct subsets from lists.
Until then, a user can select subsets of a struct from a list element via:
table = pa.table(
{"user_id": ["abc123", "qrs456"],
"interaction": [
{
"type": "click",
"element": {'a': "button"},
"values": [1, 2],
"structs":[{"foo": "bar"}]},
{
"type": "scroll",
"element": {'a': "window"},
"values": [3, 4],
"structs":[{"fizz": "buzz"}]}
]
})
pq.write_table(table, "test_nested_data.parquet")
dataset = ds.dataset("test_nested_data.parquet")
# Select only single field from structs in a list
dataset.to_table(columns={
"result_name": pc.struct_field(
pc.list_element(pc.struct_field(ds.field("interaction"), [1]),
ds.scalar(0)),
[0])
}
)
# pyarrow.Table
# result_name: string
# ----
# result_name: [[null,"buzz"]]
# Or keeping struct shape...
struct_subset_type = pa.struct([("structs", pa.list_(pa.struct([("fizz", pa.string())])))])
dataset.to_table(columns={"interaction": ds.field("interaction").cast(struct_subset_type)})
# pyarrow.Table
# interaction: struct<structs: list<item: struct<fizz: string>>>
# child 0, structs: list<item: struct<fizz: string>>
# child 0, item: struct<fizz: string>
# child 0, fizz: string
# ----
# interaction: [
# -- is_valid: all not null
# -- child 0 type: list<item: struct<fizz: string>>
# [ -- is_valid: all not null
# -- child 0 type: string
# [null], -- is_valid: all not null
# -- child 0 type: string
# ["buzz"]]]
https://issues.apache.org/jira/browse/ARROW-14596
:warning: Ticket has not been started in JIRA, please click 'Start Progress'.
I generally agree with the first two points, although I do like the explicitness of a leading dot.
The third point, it could be a potentially buggy convenience add on. ie.
pa.table({"a.dotted.field": [1], "nested": [{"field": "value"]})
# implicitly will create two columns called 'field', is that okay?
(..., columns=["a.dotted.field", "nested.field"])
vs explicitly mapping maybe?
(..., columns=[{"my_dotted_field": "a.dotted.field", "nested_field": "nested.field"}])
or convert dots to underscores?
Updated in https://github.com/apache/arrow/pull/14326/commits/1b4914f4493ae5658a9f02186f07d3f308445ae2 to take the last delimited dot path as the column name. But also thought to leave any actual column with a dot alone; if the user has dotted columns, is that up to us to automatically rename? Seems a dotted path gives more leniency in this regard.
Parquet is still materializing the entire column, correct?
Yes, for that I opened https://issues.apache.org/jira/browse/ARROW-17959
Thanks!