arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17960: [C++][Python] Implement list_slice kernel

Open milesgranger opened this issue 3 years ago • 4 comments

Will fix ARROW-17960

milesgranger avatar Oct 13 '22 04:10 milesgranger

https://issues.apache.org/jira/browse/ARROW-17960

github-actions[bot] avatar Oct 13 '22 04:10 github-actions[bot]

@jorisvandenbossche Switched to using the list builders; think that's for the better. Ready for another look. :)

There are three TODOs, (as noted in the code as well) let me know if you think any should be done on this PR:

  • start==stop should give an array of empy lists
  • step>1 support step in slicing
  • stop==-1 slice to end.

milesgranger avatar Oct 20 '22 16:10 milesgranger

There are three TODOs, (as noted in the code as well) let me know if you think any should be done on this PR:

I think those are certainly fine to leave as a follow-up.

For me, I think API wise I still have the questions:

  • What do we want to use as the default return type for slicing a variable size list? Return a fixed size list (padding with nulls to get the correct length), or keep a variable size list (not padding with nulls, just slice until the end if stop > size of a list element)
  • For slicing fixed size list arrays, what do we want to do when slicing until after the end (stop > list size)? Padding with nulls and always return a fixed size list of size stop - start, or slice until the end (Python slice semantics) and return a fixed size list of size min(list_size, stop -start)

jorisvandenbossche avatar Oct 27 '22 11:10 jorisvandenbossche

What do we want to use as the default return type for slicing a variable size list? Return a fixed size list (padding with nulls to get the correct length), or keep a variable size list (not padding with nulls, just slice until the end if stop > size of a list element)

Right now it's set to return fixed size list, but I have no opinion on this.

For slicing fixed size list arrays, what do we want to do when slicing until after the end (stop > list size)? Padding with nulls and always return a fixed size list of size stop - start, or slice until the end (Python slice semantics) and return a fixed size list of size min(list_size, stop -start)

IMO, it should flow like this:

  • Input type -> FixedSizeLIstArray:
    • Return type -> FixedSizeListArray:
      • Slice to stop, padding if stop > list_size
    • Return type -> ListArray:
      • Slice to min(list_size, stop)
  • Input type -> ListArray:
    • Return type -> FixedSizeListArray:
      • Slice to stop padding if stop > list_size
    • Return type -> ListArray:
      • Slice to min(list_size, stop) with no null padding.

milesgranger avatar Oct 28 '22 08:10 milesgranger

@lidavidm would you mind giving this a look when you have time?

milesgranger avatar Oct 31 '22 08:10 milesgranger

@lidavidm thanks a lot for the review! Do you have an opinion on some of the API questions raised at https://github.com/apache/arrow/pull/14395#issuecomment-1293369447 ? (and Miles' answer just below) I am still feeling unsure about the return_fixed_size_list keyword and its default. Especially if we would support "slicing until the end" (i.e. without specifying a stop), it feels strange that the default would then check for the longest list and pad all others (to produce a fixed size list).

Another default could be to always preserve the original type by default (so fixed size -> fixed size, variable size -> variable size), with still an option to choose explicitly. But then we need to make that keyword optional as well, since the default value would depend on the input type.

jorisvandenbossche avatar Nov 03 '22 18:11 jorisvandenbossche

What do we want to use as the default return type for slicing a variable size list? Return a fixed size list (padding with nulls to get the correct length), or keep a variable size list (not padding with nulls, just slice until the end if stop > size of a list element)

IMO it makes more sense to slice variable sized lists to variable sized lists by default.

For slicing fixed size list arrays, what do we want to do when slicing until after the end (stop > list size)? Padding with nulls and always return a fixed size list of size stop - start, or slice until the end (Python slice semantics) and return a fixed size list of size min(list_size, stop -start)

I agree with Miles's proposal above.

lidavidm avatar Nov 04 '22 11:11 lidavidm

@lidavidm is this OK for you?

jorisvandenbossche avatar Nov 15 '22 08:11 jorisvandenbossche

Most of the R tests are failing, can we rebase and check?

lidavidm avatar Nov 15 '22 13:11 lidavidm

Benchmark runs are scheduled for baseline = 3b852e49fec85b57545c6edc6c66d3da93de2c06 and contender = c3cfc7934ebdc652399af95b8696bd5a05d943fa. c3cfc7934ebdc652399af95b8696bd5a05d943fa is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Finished :arrow_down:0.3% :arrow_up:0.03%] test-mac-arm [Finished :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x [Finished :arrow_down:1.19% :arrow_up:0.0%] ursa-thinkcentre-m75q Buildkite builds: [Finished] c3cfc793 ec2-t3-xlarge-us-east-2 [Finished] c3cfc793 test-mac-arm [Finished] c3cfc793 ursa-i9-9960x [Finished] c3cfc793 ursa-thinkcentre-m75q [Finished] 3b852e49 ec2-t3-xlarge-us-east-2 [Finished] 3b852e49 test-mac-arm [Finished] 3b852e49 ursa-i9-9960x [Finished] 3b852e49 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Nov 16 '22 02:11 ursabot