datasketches-cpp icon indicating copy to clipboard operation
datasketches-cpp copied to clipboard

Theta sketch output doesn't match between Java and CPP

Open nmahadevuni opened this issue 6 months ago • 5 comments

I have run a simple theta sketch computation as below in Java and CPP

Union union = Union.builder().buildUnion();
       union.update(1);
       union.update(2);
       CompactSketch compactSketch = union.getResult();
byte[] bytes = union.getResult().toByteArray()

Java output:

02 03 03 00 00 1a cc 93 02 00 00 00 00 00 80 3f 15 f9 7d cb bd 86 a1 05 c3 97 fc 12 81 70 9d 1e

CPP :

updateThetaSketch update_sketch = updateThetaSketch::builder().build();

  update_sketch.update(1);
  update_sketch.update(2);
  auto bytes = update_sketch.compact().serialize();

CPP output:

02 03 03 00 00 1a cc 93 02 00 00 00 00 00 00 00
15 f9 7d cb bd 86 a1 05 c3 97 fc 12 81 70 9d 1e

The output seems to not match, as we see in bold at the end of first line, 4 bytes "80 3f" is missing in CPP.

Can anyone share why this is so?

nmahadevuni avatar Jul 01 '25 17:07 nmahadevuni

In this case the last 4 bytes in the first row are not used (padding for alignment) and must be 0. I am not sure why Java does not fill this padding with 0. Are you using the latest code (from main branch or latest release)?

AlexanderSaydakov avatar Jul 01 '25 21:07 AlexanderSaydakov

@AlexanderSaydakov Thanks. Yes. We are using Java 5.0.1 version. Latest from main branch on CPP.

nmahadevuni avatar Jul 02 '25 11:07 nmahadevuni

Also, if I just do single value update in the sketch like

update_sketch.update(1)

The output from Java and CPP is

Java:

01 03 03 00 00 3a cc 93 15 f9 7d cb bd 86 a1 05

and

Cpp:

01 03 03 00 00 1a cc 93 15 f9 7d cb bd 86 a1 05

The flag byte is 3a in Java and 1a in CPP.

nmahadevuni avatar Jul 02 '25 11:07 nmahadevuni

In this case Java sets one more flag "single item", but that can be derived from preamble_longs=1 and !is_empty.

On the one hand, these differences are harmless and should not affect anything except you want to have a very strict test comparing bytes. On the other hand, we may want to align our implementations and make cross-platform compatibility checks more strict. Currently we test that we can read serialized sketches from other languages and assert basic things such as the same theta and the same hash values. We may want to serialize and compare the resulting bytes instead.

AlexanderSaydakov avatar Jul 02 '25 16:07 AlexanderSaydakov

Thank you @AlexanderSaydakov. This info helps.

nmahadevuni avatar Jul 03 '25 05:07 nmahadevuni