ssdeep icon indicating copy to clipboard operation
ssdeep copied to clipboard

Is it inconsistent with the C version implemented by the author?

Open BingqingXue1 opened this issue 1 year ago • 5 comments

When we compared your code with the author's C code, we found that the calculation results were inconsistent. We found that a function was not implemented. Is this a bug?

static int has_common_substring(const char *s1, size_t s1len, const char *s2, size_t s2len)
{
  size_t i, j;
  uint32_t hashes[SPAMSUM_LENGTH - (ROLLING_WINDOW - 1)];

  if (s1len < ROLLING_WINDOW)
    return 0;
  if (s2len < ROLLING_WINDOW)
    return 0;

  // there are many possible algorithms for common substring
  // detection. In this case I am re-using the rolling hash code
  // to act as a filter for possible substring matches

  memset(hashes, 0, sizeof(hashes));

  // first compute the windowed rolling hash at each offset in
  // the first string
  struct roll_state state;
  roll_init (&state);

  for (i = 0; i < ROLLING_WINDOW - 1; i++)
    roll_hash(&state, (unsigned char)s1[i]);
  for (i = ROLLING_WINDOW - 1; i < s1len; i++)
  {
    roll_hash(&state, (unsigned char)s1[i]);
    hashes[i - (ROLLING_WINDOW - 1)] = roll_sum(&state);
  }
  s1len -= (ROLLING_WINDOW - 1);

  roll_init(&state);

  // now for each offset in the second string compute the
  // rolling hash and compare it to all of the rolling hashes
  // for the first string. If one matches then we have a
  // candidate substring match. We then confirm that match with
  // a direct string comparison */
  for (j = 0; j < ROLLING_WINDOW - 1; j++)
    roll_hash(&state, (unsigned char)s2[j]);
  for (j = 0; j < s2len - (ROLLING_WINDOW - 1); j++)
  {
    roll_hash(&state, (unsigned char)s2[j + (ROLLING_WINDOW - 1)]);
    uint32_t h = roll_sum(&state);
    for (i = 0; i < s1len; i++)
    {
      // confirm the match after checking potential match
      if (hashes[i] == h && !memcmp(s1 + i, s2 + j, ROLLING_WINDOW))
	  return 1;
    }
  }

  return 0;
}

BingqingXue1 avatar Apr 07 '24 06:04 BingqingXue1

If you think this function needs to be implemented, I can implement it and submit a PR

BingqingXue1 avatar Apr 07 '24 08:04 BingqingXue1

Do you have an example of a hash mismatch?

glaslos avatar Apr 07 '24 13:04 glaslos

The function you are mentioning is only used when comparing two hashes, so it will not have an impact on the calculated hash.

glaslos avatar Apr 07 '24 14:04 glaslos

Yes, It only affects the distance of ssdeep hash. For example , the one hash is 24:YDVLfvBD0D1b8Hlapg2MbMdI6hPZ3QCw4qat:YDbM8HlOUbMu6HACwVat. And the other is 24:YDVLfzQM/QEzpg7RMdlgx6h4Z3QCw4qat:YDmM/QO6MI6uACwVat. We calculate the score of ssdeep to be 65. But the value we calculated using the author's C program is 60. The reason is that go does not implement this function.

BingqingXue1 avatar Apr 08 '24 06:04 BingqingXue1

Then this is definitely missing. Thank you for raising it! You can either just add a test that fails for now or implement the extra condition. Otherwise I'll add it later this week.

glaslos avatar Apr 08 '24 06:04 glaslos

No problem. I'll raise PR tomorrow or the day after tomorrow.

BingqingXue1 avatar Apr 08 '24 10:04 BingqingXue1

Here is PR https://github.com/glaslos/ssdeep/pull/34

BingqingXue1 avatar Apr 09 '24 07:04 BingqingXue1

Awesome, thanks a lot! I will prepare a new release later today!

glaslos avatar Apr 09 '24 11:04 glaslos

Released 0.4.0 with the fix by @BingqingXue1 Thank you for your contribution!

glaslos avatar Apr 15 '24 15:04 glaslos