HyperMinHash-java icon indicating copy to clipboard operation
HyperMinHash-java copied to clipboard

support custom Q or R or P of BetaMinhash

Open yuhongye opened this issue 5 years ago • 2 comments

Is there any plan to support custom Q or R or P of BetaMinhash?

yuhongye avatar Feb 24 '20 12:02 yuhongye

Hi @yuhongye

There are currently no plans to support custom Q/R/P values for BetaMinhash. For that usecase, we recommend using HyperMinHash. Would you mind sharing a little bit about any usecases you have for which you'd like this feature?

Best, Shrif

sherifnada avatar Feb 24 '20 17:02 sherifnada

We want to take the intersection of multiple sets and be as accurate as possible, so we need a larger p. I tested the accuracy of HyperMinHash and BetaMinHash in calculating the jaccard index, and found that BetaMinHash is better in the case p=14, q=6, r=10. In my knowledge, the accuracy of these two algorithms should be the same, so I went to the source code to find the difference between them, and found four different implementations:

  1. calculate the position of the left-most 1-bit, BetaMinhash code is:
private static short getLeftmostOneBitPosition(boolean[] bits, int p, int q) {
    int _2toTheQ = (1 << q);

    // Notice: I think offset should be p, not p + 1, becaulse index start at 0
    int offset = p + 1; 
    for (int i = offset; i < _2toTheQ + offset; i++) {
      if (bits[i]) {
        return (short) (i + 1 - offset);
      }
    }
    return (short) (_2toTheQ + 1);
  }

and hyperminhash is

   final long zeroSearchSpace = (hllHash << p) | (long) (1 << (p - 1));
   final int leftmostOnePosition = Long.numberOfLeadingZeros(zeroSearchSpace) + 1;

I think BetaMinhash has a bug, I marked it in the comments of the code.

  1. calculate the r bits, BetaMinhash get the rightmost r bit of hash[1], and HyperMinHash get the leftmost r bits of hash[1]

  2. when calculate c in the algorithm 2.1.4 in the paper, BetaMinhash compare the whole register, but HyperMinhash only compare the mantissa.

// BetaMinhash
 for (BetaMinHash sketch : sketches) {
   itemInIntersection = itemInIntersection &&
      firstSketch.registers[i] == sketch.registers[i];
}

// HyperMinhash
for (HyperMinHash sketch : sketches) {
  itemInIntersection = itemInIntersection &&
      firstSketch.registers.getMantissaAtRegister(i) == sketch.registers.getMantissaAtRegister(i);
}
  1. BetaMinhash use the expectedCollision algorithm in the paper when calculate similarity, but HyperMinHash does not.

I want to know the impact of these differences, Thank you! (Hope you can understand my english expression.)

yuhongye avatar Feb 25 '20 09:02 yuhongye