ARROW-17960: [C++][Python] Implement list_slice kernel
https://issues.apache.org/jira/browse/ARROW-17960
@jorisvandenbossche Switched to using the list builders; think that's for the better. Ready for another look. :)
There are three TODOs, (as noted in the code as well) let me know if you think any should be done on this PR:
-
start==stopshould give an array of empy lists -
step>1support step in slicing -
stop==-1slice to end.
There are three TODOs, (as noted in the code as well) let me know if you think any should be done on this PR:
I think those are certainly fine to leave as a follow-up.
For me, I think API wise I still have the questions:
- What do we want to use as the default return type for slicing a variable size list? Return a fixed size list (padding with nulls to get the correct length), or keep a variable size list (not padding with nulls, just slice until the end if stop > size of a list element)
- For slicing fixed size list arrays, what do we want to do when slicing until after the end (stop > list size)? Padding with nulls and always return a fixed size list of size
stop - start, or slice until the end (Python slice semantics) and return a fixed size list of sizemin(list_size, stop -start)
What do we want to use as the default return type for slicing a variable size list? Return a fixed size list (padding with nulls to get the correct length), or keep a variable size list (not padding with nulls, just slice until the end if stop > size of a list element)
Right now it's set to return fixed size list, but I have no opinion on this.
For slicing fixed size list arrays, what do we want to do when slicing until after the end (stop > list size)? Padding with nulls and always return a fixed size list of size stop - start, or slice until the end (Python slice semantics) and return a fixed size list of size min(list_size, stop -start)
IMO, it should flow like this:
- Input type -> FixedSizeLIstArray:
- Return type -> FixedSizeListArray:
- Slice to
stop, padding if stop > list_size
- Slice to
- Return type -> ListArray:
- Slice to
min(list_size, stop)
- Slice to
- Return type -> FixedSizeListArray:
- Input type -> ListArray:
- Return type -> FixedSizeListArray:
- Slice to
stoppadding if stop > list_size
- Slice to
- Return type -> ListArray:
- Slice to
min(list_size, stop)with no null padding.
- Slice to
- Return type -> FixedSizeListArray:
@lidavidm would you mind giving this a look when you have time?
@lidavidm thanks a lot for the review! Do you have an opinion on some of the API questions raised at https://github.com/apache/arrow/pull/14395#issuecomment-1293369447 ? (and Miles' answer just below)
I am still feeling unsure about the return_fixed_size_list keyword and its default. Especially if we would support "slicing until the end" (i.e. without specifying a stop), it feels strange that the default would then check for the longest list and pad all others (to produce a fixed size list).
Another default could be to always preserve the original type by default (so fixed size -> fixed size, variable size -> variable size), with still an option to choose explicitly. But then we need to make that keyword optional as well, since the default value would depend on the input type.
What do we want to use as the default return type for slicing a variable size list? Return a fixed size list (padding with nulls to get the correct length), or keep a variable size list (not padding with nulls, just slice until the end if stop > size of a list element)
IMO it makes more sense to slice variable sized lists to variable sized lists by default.
For slicing fixed size list arrays, what do we want to do when slicing until after the end (stop > list size)? Padding with nulls and always return a fixed size list of size stop - start, or slice until the end (Python slice semantics) and return a fixed size list of size min(list_size, stop -start)
I agree with Miles's proposal above.
@lidavidm is this OK for you?
Most of the R tests are failing, can we rebase and check?
Benchmark runs are scheduled for baseline = 3b852e49fec85b57545c6edc6c66d3da93de2c06 and contender = c3cfc7934ebdc652399af95b8696bd5a05d943fa. c3cfc7934ebdc652399af95b8696bd5a05d943fa is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2
[Finished :arrow_down:0.3% :arrow_up:0.03%] test-mac-arm
[Finished :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x
[Finished :arrow_down:1.19% :arrow_up:0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] c3cfc793 ec2-t3-xlarge-us-east-2
[Finished] c3cfc793 test-mac-arm
[Finished] c3cfc793 ursa-i9-9960x
[Finished] c3cfc793 ursa-thinkcentre-m75q
[Finished] 3b852e49 ec2-t3-xlarge-us-east-2
[Finished] 3b852e49 test-mac-arm
[Finished] 3b852e49 ursa-i9-9960x
[Finished] 3b852e49 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java