entity2vec Fix binary search

Basically we cannot reuse exit status (quant_A, mid) as final result, as the following log shows, current logic will produce quantized vector with mean_err > target_error.

Since there are log(N) candidate nodes in the search tree, It would be simpler / easier if we just calculate it in the last pass.

Log before fix: https://gist.github.com/sleepsort/1d0d1582825c6ea248eca19a0a471d1f

Log after fix: https://gist.github.com/sleepsort/64ac49af24dc98eefa3ea8c4575ab3b6

To reproduce:

Before patching this commit, download additional patch with data: https://gist.github.com/sleepsort/0585a47f42d200b1d445201e3d928474
Run command: ./model_quantization.py quant test.w2v.txt /tmp/test.w2v
Patch this commit
Rerun the command to compare.

Oct 02 '16 16:10 sleepsort

Yeah, I understand that the result is always near the error boundary. Just to make sure the codes express the intention clearly.

And no, the algorithm is not right, simply returning high doesn't solve the problem, a simple counter-example is an array of the same number [0.1] * 128, the algorithm will return indice 1 instead of indice 0, which is surely wrong.

This error comes from the fact that you're using high - low > 1 as the position constraint. We cannot modify this to narrow the position, otherwise due to the int truncation, mid = (low + high) / 2 will always be low in some cases and will cause an infinite for loop

I wrote an example test to show that my algorithm is right (I can try to prove it later), and current codes have problems.

https://gist.github.com/sleepsort/708de5484d8484471ed33c4230790d39

Oct 02 '16 19:10 sleepsort

a simple counter-example is an array of the same number [0.1] * 128, the algorithm will return indice 1 instead of indice 0, which is surely wrong.

That doesn't satisfy the invariant I mentioned in the comment. What is the invariant of your algorithm?

Oct 02 '16 19:10 ot

That doesn't satisfy the invariant I mentioned in the comment

Ah I see, sorry I didn't understand the term invariant previously. I don't know in real case whether we'll have the same mean_err for some ranges, how about this?

[4, 3] target = 4

Just in theory, it is possible that we miss a better compression method by returning the indice 1 instead of 0. And this example breaks the invariant as well.

I might need some time to figure out the invariant in my algorithm... My intention is to make sure that error(low) > target_error will never happen, unless all the numbers in the array has error(i) > target_error, which already meets the exit condition. Can that be considered as an invariant?

Oct 02 '16 19:10 sleepsort

That still doesn't satisfy the invariant (in this case, that error(low) > target). I'm not sure you're familiar with standard algorithm correctness proofs, see for example https://en.wikipedia.org/wiki/Loop_invariant.

In the algorithm as it is now, we have: Precondition: error(high) <= target < error(low) Invariant: Same as precondition Postcondition: same as precondition, plus high - low = 1

From the postcondition you can derive that low is the largest integer with error > target, and high is the smallest integer with error <= target. So if you want the best compression such that the error is <= target, just use high. So, why do you want to change binary search implementation?

Oct 02 '16 20:10 ot

Thanks for the link, I'll check it later.

Yes, your algorithm is right if the invariant is true. My point is that, we cannot always assert that the real data meets this invariant. For example edge cases sush as error(i) == error(j), and error(low) == target. The codes are vulnerable to those cases, which might happen in production.

And my initial intention was to make this binary search more general.

Oct 02 '16 20:10 sleepsort

For example edge cases sush as error(i) == error(j)

This is not an edge case, and the algorithm still works. About the preconditions, you can check them before running the search and decide what to do if they are not met: in this case it means that the initial range is not wide enough and we should just stop the algorithm, rather than trying to return an endpoint of the range. I prefer using an algorithm that has obvious guarantees. What are the guarantees of your version? Can you prove them?

Oct 02 '16 20:10 ot

For example edge cases sush as error(i) == error(j) This is not an edge case, and the algorithm still works.

Oh, to clarity, I meant it breaks the precondition instead of invariant. Since the precondition is: error(high) <= target < error(low) ==> error(high) < error(low). error(high) == error(low) is the so-called edge case. Anyway, it can be pre-checked.

And I understand the difference in our preferences : ). My preference is on implementation that handles all possible corner cases (my understanding of general) and with less codes.

As my understanding, to implement your algorithm to make sure it is error-immune, the code will be like

// quantize & dequantize on low to make sure `target < error(low)` [1]
// quantize & dequantize on high to make sure `error(high) <= target` [2]
while (...) {
  // quantize & dequantize on mid to find optimal solution [3]
}
// quantize & dequantize on high to get result [4]

And my algorithm can remove [1], [2], and report exception of [2] when [4] is calculated. No matter what is the initial setting of low & high.

About the invariant in my algorithm, how about this? Assumption: error(0) = target + 1, error(128) <= target Precondition: error(low - 1) > target >= error(high) Invariant: Same as precondition Postcondition: same as precondition, plus low = high

First assumption is dummy, and exception handling of second assumption can be handled after the for loop, so the code looks like:

while (...) {
  // quantize & dequantize on mid to find optimal solution [3]
}
// quantize & dequantize on high to get result [4]
if (error(high) > target) {
  // exception handling
}
return high

And to explain the loop in my codes:

every time after f moves, assert(A[f-1] > target);
every time after t moves, assert(A[t] <= target), (In loop, assert(m < t), so t = m moves t)
outside while loop, assert(f == t) & assert(A[t-1] > target). The only case when A[t] <= target is not true, is t hasn't moved.

Oct 02 '16 20:10 sleepsort