codept icon indicating copy to clipboard operation
codept copied to clipboard

Use `cuts` instead of `split_on_char`

Open dinosaure opened this issue 2 years ago • 4 comments

This PR is a possible improvement about what we expect when we use the basic function split_on_char. From my point of view, String.split_on_char is problematic because:

  • the function can split only on a character (it's better to write cuts ~sep:Filename.dir_sep instead of split_on_char ~sep:Filename.dir_sep.[0] - especially when this value depends on the system)
  • the function by default keeps empty parts. Sometimes, it's needed (especially for Paths.S.parse_filename) but sometimes it's not needed (others usages) and such behavior can lead to undefined behavior

For these two reasons, Support.cuts (took from the great astring):

  • take a string as a separator
  • explicit when we want to keep empty parts or not (and be sure about what we manipulate by this way)

The documentation explicits invariants too.

dinosaure avatar Feb 23 '23 12:02 dinosaure

I am surprised that astring is using the O( match_len * text_length) algorithm. Do you know why?

Octachron avatar Feb 23 '23 12:02 Octachron

I am surprised that astring is using the O( match_len * text_length) algorithm. Do you know why?

I don't know details about that (/cc @dbuenzli) and, from my point of view, the algorithm seems naive but good enough for the purpose.

dinosaure avatar Feb 23 '23 13:02 dinosaure

I don't think there's any particular reason. On most uses of these kind of functions it seems good enough.

That being said since I have been brought here I will give my two unsolicited €cents :–)

I'm not sure I see the point of this PR, it seems String.split_on_char does the job in every case, use that it already belongs to the stdlib. Less code and no complexity objections. I suspect I won't see in my life-time a platform which has multibyte path separators. If you worry about these empties then most of the applications here go over the list right after so you can eliminate those in that pass and/or simply provide a support functions that filters the result of String.split_on_char.

dbuenzli avatar Feb 23 '23 13:02 dbuenzli

I'm not sure I see the point of this PR, it seems String.split_on_char does the job in every case, use that it already belongs to the stdlib. Less code and no complexity objections. I suspect I won't see in my life-time a platform which has multibyte path separators. If you worry about these empties then most of the applications here go over the list right after so you can eliminate those in that pass and/or simply provide a support functions that filters the result of String.split_on_char.

Yes, to clarify the goal of this PR, it's just matter of readability which is particulary subjective depending on the reader. As you said, split_on_char does job but I let @Octachron to judge if it's better or not.

dinosaure avatar Feb 23 '23 13:02 dinosaure

I have pushed a commit 36bbd99fc88b which switch to the more complex version if the path separator is a full string (with room for optimisation if it is ever needed). I am not fond of the empty parameter that seems better handled by a separate function. I am thus closing this PR as completed, thanks for the suggestion!

Octachron avatar Apr 01 '24 19:04 Octachron