cpython icon indicating copy to clipboard operation
cpython copied to clipboard

python versions of bisect_right and bisect_left can fail guarantees

Open mike-neergaard opened this issue 1 year ago • 14 comments

Bug report

Bug description:

The documentation for bisect.bisect_right and bisect.bisect_left states in both cases that

The returned insertion point ip partitions the array a into two slices such that all(elem <= x for elem in a[lo : ip]) is true for the left slice and all(elem > x for elem in a[ip : hi]) is true for the right slice.

But in the python version of bisect, that guarantee can fail:

import sys
import _bisect as c_bisect
sys.modules['_bisect'] = None
import bisect as py_bisect

def check_guarantees(bisect_func, a, x, lo, hi):
    ip = bisect_func(a, x, lo, hi)
    check_pass = True 
    if not all(elem <= x for elem in a[lo : ip]): 
        print("\tFAIL. Elements of",a[lo:ip]," are not <=",x)
        check_pass = False
    if not all(elem > x for elem in a[ip : hi]):
        print("\tFAIL. Elements of",a[ip:hi]," are not >",x)
        check_pass = False
    if  check_pass: print("\tOk")

#print(c_bisect.bisect_right) # <built-in function bisect_right>
#print(py_bisect.bisect_right) # <function bisect_right at 0x0000023164AE7C40>

a = [0,1,2,3,4]
x = 8
lo = 3
hi = -1

print("bisect_left from C")
check_guarantees(c_bisect.bisect_left, a, x, lo, hi)

print("bisect_left from python")
check_guarantees(py_bisect.bisect_left, a, x, lo, hi)

print("bisect_right from C")
check_guarantees(c_bisect.bisect_right, a, x, lo, hi)

print("bisect_right from python")
check_guarantees(py_bisect.bisect_right, a, x, lo, hi)

Attempt this online Outputs :

bisect_left from C
	Ok
bisect_left from python
	FAIL. Elements of [3]  are not > 8
bisect_right from C
	Ok
bisect_right from python
	FAIL. Elements of [3]  are not > 8

I think there is an extra check in the C version for hi < 0.

CPython versions tested on:

3.13

Operating systems tested on:

Linux

Linked PRs

  • gh-125915

mike-neergaard avatar Oct 23 '24 17:10 mike-neergaard

I think neither the Python nor the C implementation are actually meant to work with a negative hi argument. The C implementation does special case hi == -1 but that's only because it is the default value produced by argument clinic when you omit hi (same for the Python case when hi is None).

We could either make it an error to pass a negative value as is the case with lo currently, or fix it to work with negative values.

cc @rhettinger since you last worked on this module

tomasr8 avatar Oct 23 '24 20:10 tomasr8

Bounds checking sounds great to me.

mike-neergaard avatar Oct 23 '24 21:10 mike-neergaard

If there are no other objections, I will initiate a PR that will synchronize the hi check with lo.

diff --git a/Modules/_bisectmodule.c b/Modules/_bisectmodule.c
index 56322c48b7c..668e475cb5c 100644
--- a/Modules/_bisectmodule.c
+++ b/Modules/_bisectmodule.c
@@ -61,7 +61,7 @@ internal_bisect_right(PyObject *list, PyObject *item, Py_ssize_t lo, Py_ssize_t
         PyErr_SetString(PyExc_ValueError, "lo must be non-negative");
         return -1;
     }
-    if (hi == -1) {
+    if (hi < 0) {
         hi = PySequence_Size(list);
         if (hi < 0)
             return -1;
@@ -245,7 +245,7 @@ internal_bisect_left(PyObject *list, PyObject *item, Py_ssize_t lo, Py_ssize_t h
         PyErr_SetString(PyExc_ValueError, "lo must be non-negative");
         return -1;
     }
-    if (hi == -1) {
+    if (hi < 0) {
         hi = PySequence_Size(list);
         if (hi < 0)
             return -1;

Alternatively, negative numbers are not allowed, consistent with lo.

rruuaanng avatar Oct 24 '24 07:10 rruuaanng

For the two solutions mentioned by

  1. We could either make it an error to pass a negative value as is the case with lo currently.
  2. fix it to work with negative values.

I chose the latter because almost all test functions in the test assume that when hi is a negative number, there will be a default processing. But I don't want to refactor this test on a large scale, so I chose the latter, and in the test, it mainly judges that when its hi is a negative number, it can automatically handle it. Is negative and it is not negative 1, it can be negative infinity.

rruuaanng avatar Oct 24 '24 08:10 rruuaanng

I appreciate your enthusiasm @rruuaanng but for future reference, I think it'd be better to wait a bit before opening a PR as this issue is relatively recent and many people probably haven't had the time to comment on it yet. For instance, it is still not clear which of the two options I suggested we should go for (or some other option I didn't think of).

tomasr8 avatar Oct 24 '24 10:10 tomasr8

I'm looking forward to the final result of the discussion. Maybe we can put PR on hold for the time being and wait for the result of the discussion.

rruuaanng avatar Oct 24 '24 11:10 rruuaanng

My 2 cents:

  • Idea 1: reject negative values for hi and change the sentinel value in both the Python and the C implementation.
  • Idea 2: make it work with negative values. It does not make sense however to only make negative hi work so we'll need to change the behaviour of a negative lo.
  • Idea 3: do nothing and document that the values should be non-negative. This is in between the status quo and idea 1. The Python implementation is generally not used because it's slower and I don't know when we cannot have the C implementation.

Personally, I think it should be fine with idea 1. By the way, we can ask AC to use a different sentinel value with some c_default IIRC.

picnixz avatar Oct 24 '24 11:10 picnixz

I agree with Bénédikt. However, if you make the hi argument consistent with the check for lo, you may need to refactor test_bisect.py, as it seems that many of the tests there already assume that hi can be negative and ignore the check for it (I've tried it)

rruuaanng avatar Oct 24 '24 11:10 rruuaanng

It's not an issue to refactor tests since consistency is usually preferred. However, before implementing anything, let us first decide whether we want to reject or accept negative values first.

picnixz avatar Oct 24 '24 11:10 picnixz

Idea 2: make it work with negative values. It does not make sense however to only make negative hi work so we'll >need to change the behaviour of a negative lo.

Idea 2 is interesting and would require some small change to the algorithm. For example, if the user asks to search [0,1,2] from -2 to 3, perhaps what they really mean is to search within [0,1,2][-2:3] = [1,2]. I am intrigued by the possibility that we could accomplish that by something on the order of

if lo < 0:
    lo = len(a) - lo

It's quite possible that the incautious user could wind up specifying hi < lo, but that case already works.

mike-neergaard avatar Oct 24 '24 11:10 mike-neergaard

While it could be interesting, I'm wondering whether it is useful. There are other places in the standard library where we don't assume negative indices because it complicates the flow and is not really useful. The fact that a negative lo was explicitly rejected shows that this was probably the original intent (namely, $0 \leq lo \leq hi \leq len(seq)$).

picnixz avatar Oct 24 '24 11:10 picnixz

Not sure what the general mood is on backwards compatibility. If existing code sometimes passes hi<0 and the code suddenly starts crashing?

I can argue that it's useful in the sense that the power to pass lo<0 gives me the ability to search the last few elements of an array - a power that I would have to exercise in some other way at the moment.

mike-neergaard avatar Oct 24 '24 11:10 mike-neergaard

If existing code sometimes passes hi<0 and the code suddenly starts crashing?

The docs are not really clear. They don't say "you cannot put a negative value" but the default values are 0 and len(seq), so I assume that they weren't really meant for negative indices. Note that the Python implementation is usually not the one we should align with since it's usually the slow version of the C implementation and we expect users to use the C implementation.

If existing code passes hi < 0 they are likely using an undocumented feature which does not do what they think it does (if they are using the C implementation). Indeed, passing hi == -1 just means using hi = len(seq) for now so passing hi = -2 is likely an undefined behaviour (although undocumented).

I'll defer the final decision to Raymond but I personally think we should not accept negative indices. If negative indices are important, it's better if the user convert them beforehand to positive ones rather than asking for us to convert it (this is not the costly step; the costly step is the search).

picnixz avatar Oct 24 '24 12:10 picnixz

It's a useless comment, but I will remark that this discussion is excellent both in technical depth and tone, and I appreciate both.

mike-neergaard avatar Oct 24 '24 13:10 mike-neergaard

I don't know that it changes anything, but I was able to get the C module to fail as well with the parameters

a = [0,1,2,3,4]
x = 8
lo = 2
hi = -2

mike-neergaard avatar Oct 24 '24 19:10 mike-neergaard

Well it's because there is no check on the hi value and that we assume that hi should just not be negative (that's what I meant by "undefined behaviour" though it's a well-defined one, since the loop would be empty at runtime; anyway, it's just not expected to work with negative values).

picnixz avatar Oct 24 '24 19:10 picnixz

Maybe you should modify the code for this module in your own interpreter (I'm guessing you didn't modify the local interpreter, but just ran it)

rruuaanng avatar Oct 25 '24 00:10 rruuaanng

Please RUANG, this kind of comment does not help in the discussion. For now there is nothing to modify.

We need to answer the question: should we allow negative values or not? if not, should we change the sentinel value so to raise an error in case of a negative value? (because we cannot distinguish the case where hi is not specified from the case where hi is explicitly -1).

My answer is: we should NOT support negative values given by the user. If they pass -1, that's just using an undocumented feature. Or if they pass -2, that's just playing with fire. We can say that -1 acts as the length of the sequence (and only -1, nothing else) or just change -1 into None as the sentinel.

I will repeat it again but the C implementation should be the one we align the Python implementation on when needed and not the reverse. While the Python implementation can use None as a sentinel and handle it accordingly without too much of performamce issues, the C implementation might not (accepting both None and integers instead of integers adds additional checks).

The Python implementation is slow so it does not really matter that we make it a bit slower by addding checks if needed. However @rhettinger already explained on the PR that the choice of a no-check (in the C implementation) was historical and deliberate.

So, with that in mind, I think we should just mention in the docs that passing negative indices is not supported, yet not eagerly checked for performance reasons.

Note that changing the sentinel might however affect performances due to AC generated code. I didn't run any benchmark so I cannot say how performances will be affected though. Nonetheless, I am pretty sure that fully supporting negative indices is a rare use case and that we shouldn't degrade performances for normal cases (because in the end we will still need to convert them into positive ones IMO).

@rhettinger Could you make a decision here please? I am -1 on accepting negative indices, +0 on having an eager check and +1 on having the docs clarified.

picnixz avatar Oct 25 '24 06:10 picnixz

Sorry picnixz. I'm more of a believer that an exception will be thrown when hi is negative. If hi is default, its value is a list value. I've always been in favor of not allowing negative numbers, because negative numbers don't help with searching. If no one disagrees with my opinion, I will change it to when hi is negative, it will throw an error, and by default, the value of hi is the length of the list

rruuaanng avatar Oct 25 '24 07:10 rruuaanng

>>> import _bisect
>>>
>>> _bisect.bisect_left([1,2],1,hi=-1)
0
>>> _bisect.bisect_left([1,2],1,hi=-2)
Traceback (most recent call last):
  File "<python-input-14>", line 1, in <module>
    _bisect.bisect_left([1,2],1,hi=-2)
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
ValueError: hi must be non-negative
>>> _bisect.bisect_left([1,2],1)
0

@picnixz @mike-neergaard Is this behavior acceptable? -1 will be used as a sentinel value, and when hi is defaulted, -1 will be used. However, it does not allow any negative number other than -1 (that is, negative infinity).

And this solution does not require a large-scale rewrite of test_bisect.py, in fact, just add a hi < 0 and keep the original content.

rruuaanng avatar Oct 25 '24 09:10 rruuaanng

The initial answer is yes, so you can take yes for an answer.

A different answer is that if I want to propose a faster bisection algorithm (which I do), I will eventually propose a check of the form

if (hi <= lo){
    return lo;
}

which would not throw an error, but would (I believe) fix the bug.

That is, however, beyond the scope of the current discussion, so your fix is fine with me.

mike-neergaard avatar Oct 25 '24 14:10 mike-neergaard