What is the expected behavior with extreme values?

Open dleviminzi opened this issue 1 year ago • 0 comments

Suppose we are doing an L2 calculation with int8. What are we expecting if we give it [127, 127, 127, 127], [-128, -128, -128, -128]? There will be overflow when taking the absolute difference of the two vectors.

Our options seem to be:

Use larger data types during calculations We could cast the values to larger data types or use larger types for accumulation when calculating the result. This approach would allow us to compute the correct answer even if intermediate steps exceed the capacity of the original data type. However, this method would negatively impact performance.
Detect and warn about overflow Alternatively, we could implement a system to detect when an overflow occurs and alert users to this issue. This approach maintains the original performance but requires additional logic to identify overflow situations.

The current behavior can be observed by adding the following test cases to test_vec_distance_l2

    check([127, 127, 127], [-128, -128, -128], dtype=np.int8)
    check([127]*10, [-128]*10, dtype=np.int8)
    check([np.finfo(np.float32).max, 0.1, 1.2, 0.1, 1.2, 0.1, 1.2, 0.1, 1.2, 0.1, 1.2, 0.1, 1.2, 0.1, 1.2, 0.1, 1.2], 
          [np.finfo(np.float32).min, 0.1, 1.2, 0.1, 1.2, 0.1, 1.2, 0.1, 1.2, 0.1, 1.2, 0.1, 1.2, 0.1, 1.2, 0.1, 1.2])

and tweaking the check function

    def check(a, b, dtype=np.float32):
        if dtype == np.float32:
            transform = "?"
        elif dtype == np.int8:
            transform = "vec_int8(?)"

        a_sql_t = np.array(a, dtype=dtype)
        b_sql_t = np.array(b, dtype=dtype)
        
        x = vec_distance_l2(a_sql_t, b_sql_t, a=transform, b=transform)
        # not using actual types for numpy in order to get correct answer 
        # (using actual types will result in a warning about having detected overflow)
        y = npy_l2(np.array(a), np.array(b))
        assert isclose(x, y, abs_tol=1e-6)

Jun 25 '24 21:06 dleviminzi