Theta sketch output doesn't match between Java and CPP
I have run a simple theta sketch computation as below in Java and CPP
Union union = Union.builder().buildUnion();
union.update(1);
union.update(2);
CompactSketch compactSketch = union.getResult();
byte[] bytes = union.getResult().toByteArray()
Java output:
02 03 03 00 00 1a cc 93 02 00 00 00 00 00 80 3f 15 f9 7d cb bd 86 a1 05 c3 97 fc 12 81 70 9d 1e
CPP :
updateThetaSketch update_sketch = updateThetaSketch::builder().build();
update_sketch.update(1);
update_sketch.update(2);
auto bytes = update_sketch.compact().serialize();
CPP output:
02 03 03 00 00 1a cc 93 02 00 00 00 00 00 00 00
15 f9 7d cb bd 86 a1 05 c3 97 fc 12 81 70 9d 1e
The output seems to not match, as we see in bold at the end of first line, 4 bytes "80 3f" is missing in CPP.
Can anyone share why this is so?
In this case the last 4 bytes in the first row are not used (padding for alignment) and must be 0. I am not sure why Java does not fill this padding with 0. Are you using the latest code (from main branch or latest release)?
@AlexanderSaydakov Thanks. Yes. We are using Java 5.0.1 version. Latest from main branch on CPP.
Also, if I just do single value update in the sketch like
update_sketch.update(1)
The output from Java and CPP is
Java:
01 03 03 00 00 3a cc 93 15 f9 7d cb bd 86 a1 05
and
Cpp:
01 03 03 00 00 1a cc 93 15 f9 7d cb bd 86 a1 05
The flag byte is 3a in Java and 1a in CPP.
In this case Java sets one more flag "single item", but that can be derived from preamble_longs=1 and !is_empty.
On the one hand, these differences are harmless and should not affect anything except you want to have a very strict test comparing bytes. On the other hand, we may want to align our implementations and make cross-platform compatibility checks more strict. Currently we test that we can read serialized sketches from other languages and assert basic things such as the same theta and the same hash values. We may want to serialize and compare the resulting bytes instead.
Thank you @AlexanderSaydakov. This info helps.