arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-14596: [C++][Python] Read table nested struct fields in columns

Open milesgranger opened this issue 3 years ago • 4 comments

Part of ARROW-14596, and ARROW-13798

Does not propose to solve selecting from lists in this PR; only supporting dotted paths into arbitrarily nested structs using existing FromDotPath. Selecting from lists will require further discussion and support for kernels to select struct subsets from lists.

Until then, a user can select subsets of a struct from a list element via:


table = pa.table(
    {"user_id": ["abc123", "qrs456"],
     "interaction": [
        {
            "type": "click",
            "element": {'a': "button"},
            "values": [1, 2],
            "structs":[{"foo": "bar"}]},
        {
            "type": "scroll",
            "element": {'a': "window"},
            "values": [3, 4],
            "structs":[{"fizz": "buzz"}]}
     ]
    })

pq.write_table(table, "test_nested_data.parquet")
dataset = ds.dataset("test_nested_data.parquet")

# Select only single field from structs in a list
dataset.to_table(columns={
        "result_name": pc.struct_field(
                pc.list_element(pc.struct_field(ds.field("interaction"), [1]), 
                                ds.scalar(0)), 
                [0])
        }
)
# pyarrow.Table
# result_name: string
# ----
# result_name: [[null,"buzz"]]

# Or keeping struct shape...

struct_subset_type = pa.struct([("structs", pa.list_(pa.struct([("fizz", pa.string())])))])
dataset.to_table(columns={"interaction": ds.field("interaction").cast(struct_subset_type)})
# pyarrow.Table
# interaction: struct<structs: list<item: struct<fizz: string>>>
#   child 0, structs: list<item: struct<fizz: string>>
#       child 0, item: struct<fizz: string>
#           child 0, fizz: string
# ----
# interaction: [
#   -- is_valid: all not null
#   -- child 0 type: list<item: struct<fizz: string>>
# [      -- is_valid: all not null
#       -- child 0 type: string
# [null],      -- is_valid: all not null
#       -- child 0 type: string
# ["buzz"]]]

milesgranger avatar Oct 05 '22 13:10 milesgranger

https://issues.apache.org/jira/browse/ARROW-14596

github-actions[bot] avatar Oct 05 '22 13:10 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Oct 05 '22 13:10 github-actions[bot]

I generally agree with the first two points, although I do like the explicitness of a leading dot.

The third point, it could be a potentially buggy convenience add on. ie.

pa.table({"a.dotted.field": [1], "nested": [{"field": "value"]})
# implicitly will create two columns called 'field', is that okay?
(..., columns=["a.dotted.field", "nested.field"])

vs explicitly mapping maybe?

(..., columns=[{"my_dotted_field": "a.dotted.field", "nested_field": "nested.field"}])

or convert dots to underscores?

milesgranger avatar Oct 07 '22 10:10 milesgranger

Updated in https://github.com/apache/arrow/pull/14326/commits/1b4914f4493ae5658a9f02186f07d3f308445ae2 to take the last delimited dot path as the column name. But also thought to leave any actual column with a dot alone; if the user has dotted columns, is that up to us to automatically rename? Seems a dotted path gives more leniency in this regard.

milesgranger avatar Oct 10 '22 09:10 milesgranger

Parquet is still materializing the entire column, correct?

Yes, for that I opened https://issues.apache.org/jira/browse/ARROW-17959

jorisvandenbossche avatar Oct 18 '22 13:10 jorisvandenbossche

Thanks!

jorisvandenbossche avatar Oct 20 '22 07:10 jorisvandenbossche