arrow Is any plan to develop 'utf8_slice_charunits' like existed 'utf8_slice

STOP! Are you reporting a bug, a possible bug, or requesting a feature? If so, please report under the ARROW project on the ASF JIRA server https://issues.apache.org/jira/browse/ARROW. This JIRA server is free to use and open to the public, but you must create an account if it is your first time.

See our contribution guidelines for more information: http://arrow.apache.org/docs/developers/contributing.html

We have GitHub issues available as a way for new contributors and passers-by who are unfamiliar with Apache Software Foundation projects to ask questions and interact with the project. Do not be surprised if the first response is to open a JIRA issue or to write an e-mail to one of the public mailing lists:

Development discussions: [email protected] (first subscribe by sending an e-mail to [email protected]).
User discussions: [email protected] (first subscribe by sending an e-mail to [email protected]).

Thank you!

Jul 25 '22 07:07 chenbaggio

like utf8_slice_codeunits, but start/end/step is in byte unit

Jul 25 '22 07:07 chenbaggio

Hi @chenbaggio, could you expand on what you refer to as a "byte unit". If you refer to char (signed integral), you should be able to use it (via casting) with the current int64_t type for start/end/step. Probably, I am misunderstanding your request, so could you give an example.

Also, are there other language implementations that have a similar operation?

Jul 27 '22 03:07 edponce

Dears: here, I can give one example to descirbe why need a function to extract binary in byte unit

             In distribute database, data has distribute policy and relatived hash algorithm for different data type,
              here we just discuss string-like and binary type, the hash algorithm need detach string-like or binary
              in bytes to calculating, for example , take 1-4 byte cast to integer and shift-left 16 bits, then take 5-6byte cast to
               integer and the result from last step, and so on, the  'utf8_slice_codeunits' function can partly meet the require if all
              are ascii,  but if the string-like contain chinese, one chinese may occupied three bytes,  start 1 to end 3, three utf8 character
              may take nine bytes, but it not meet the hash algorithm, it only need 3 bytes, so if provide a function but not cast, the same 
              function arguments like 'utf8_slice_codeunits', it may called 'binary_slice_byteunit'

At 2022-07-27 11:23:12, "Eduardo Ponce Mojica" @.***> wrote:

Hi @chenbaggio, could you expand on what you refer to as a "byte unit". If you refer to char (signed integral), you should be able to use it (via casting) with the current int64_t type for start/end/step. Probably, I am misunderstanding your request, so could you give an example.

Also, are there other language implementations that have a similar operation?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Jul 27 '22 08:07 chenbaggio

Thanks for the explanation. I consider it is a reasonable compute function to add given that some string functions come in multiple variants (ascii, utf8, binary). For example, currently there are related functions binary_replace_slice and utf8_replace_slice that support both the byte-oriented and codeunit-oriented forms of data.

I searched in JIRA issues and did not find one related to this request. @chenbaggio would you be willing to open a JIRA issue to add this compute function and submit a PR? If not, let us know and we will help with this request.

Jul 27 '22 15:07 edponce

Dears : sorry, it seems I can not longin JIRA, not the user for it, would you help to do it or suggest me how to register,thank you!

At 2022-07-27 23:42:02, "Eduardo Ponce Mojica" @.***> wrote:

Thanks for explanation. I consider it is a reasonable compute function to add given that some string functions come in multiple variants (ascii, utf8, binary). For example, currently there are related functions binary_replace_slice and utf8_replace_slice that support both the byte-oriented and codeunit-oriented forms of data.

I searched in JIRA issues and did not find one related to this request. @chenbaggio would you be willing to open a JIRA issue to add this compute function and submit a PR? If not, let us know and we will help with this request.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Aug 04 '22 03:08 chenbaggio

You first need to create a JIRA account in order to be able to create issues. Here is the documentation guide for creating Apache Arrow issues.

Aug 04 '22 04:08 edponce

thank you for your help, created the issue ARROW-17301, but the assignee is still blank, please help finish it, thank you!

At 2022-08-04 11:26:27, "hero" @.***> wrote:

Dears : sorry, it seems I can not longin JIRA, not the user for it, would you help to do it or suggest me how to register,thank you!

At 2022-07-27 23:42:02, "Eduardo Ponce Mojica" @.***> wrote:

Thanks for explanation. I consider it is a reasonable compute function to add given that some string functions come in multiple variants (ascii, utf8, binary). For example, currently there are related functions binary_replace_slice and utf8_replace_slice that support both the byte-oriented and codeunit-oriented forms of data.

I searched in JIRA issues and did not find one related to this request. @chenbaggio would you be willing to open a JIRA issue to add this compute function and submit a PR? If not, let us know and we will help with this request.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Aug 04 '22 07:08 chenbaggio

Hi @chenbaggio, it seems that a simple binary slice function may not be the solution for your use case. Currently, Arrow only contains logical shifts, so no arithmetic nor rotational shifts. If you consider that a rotational shift will suffice for your hash algorithm, please submit a ticket for this other compute function.

Aug 08 '22 21:08 edponce

Is any plan to develop 'utf8_slice_charunits' like existed 'utf8_slice_codeunits'