Incorrect result ?
HI guys, Consider the following program:
input relation Z(a:string)
relation PQRI(a:string)
relation PLEY(a:string)
relation NFUV(a:string)
output relation OUT(a:string)
PQRI(v) :- Z(v), Z(nbj).
PLEY(o) :- PQRI(x), Z(o), Z(x).
NFUV(q) :- Z(fym), PLEY(q).
OUT(t) :- NFUV(ssz), PLEY(arv), PLEY(t).
If I run the program (after compiling in release mode), I get:
$ ./rules_orig_cli < facts.dat
OUT{.a = "3u37ezDdBg"}
OUT{.a = "EbgMd2DV3K"}
OUT{.a = "uwsUc"}
OUT{.a = "zDdBgEbgMd"}
Now if i just add PQRI(z), PQRI(x) in rule PLEY to get a new program:
input relation Z(a:string)
relation PQRI(a:string)
relation PLEY(a:string)
relation NFUV(a:string)
output relation OUT(a:string)
PQRI(v) :- Z(v), Z(nbj).
PLEY(o) :- PQRI(x), Z(o), Z(x), PQRI(z), PQRI(x). // modified rule
NFUV(q) :- Z(fym), PLEY(q).
OUT(t) :- NFUV(ssz), PLEY(arv), PLEY(t).
after recompiling and running it again, now I get an empty result.
$ ./rules_orig_cli < facts.dat
However, according to my understanding, this modification should not change the result of the computation. Right? or am I misunderstanding something?
$ cat facts.dat
start;
insert Z("3u37ezDdBg"),
insert Z("uwsUc"),
insert Z("zDdBgEbgMd"),
insert Z("EbgMd2DV3K");
commit;
dump OUT;
I am running this on ddlog version: 5b1c9c1a70992032d271934f2652dcf6952b390d
Operating System: Debian GNU/Linux 10 (buster)
Kernel: Linux 4.19.75.1.amd64-smp
Architecture: x86-64
Yes, looks like a bug, thanks for reporting. I will investigate.
So the good news is, this is a known... feature, yes, definitely feature. But clearly we are not handling this situation well.
Internally, differential dataflow stores all intermediate relations as multisets where the multiplicity of each element is the number of times it was derived. We currently use 32-bit weights to reduce the memory footprint of the program. This means that we cannot represent facts with multiplicities >2^31. This has so far never caused issues in real programs. Although your example looks very simple, it repeatedly computes Cartesian products of relations containing records with increasing multiplicities: PQRI contains 4 derivations of each record, PLEY - 1024, NFUV - 4096, OUT - 2^36 (I think), at which point we have an integer overload. Unfortunately, instead of crashing DDlog returns an incorrect result.
There are several things we could do to improve the handling of such situations:
- Better error reporting
- Use 64-bit weights, e.g., we may support this as a compiler switch, similar to how ddlog already allows the user to enable 32-bit instead of 16-bit nested timestamps.
- Convert multisets to sets internally using the
distinctoperator. This is how DDlog used to work in the past, but this is a very expensive solution in terms of speed and memory, therefore at the moment we only applydistinctto output relations (e.g., try making one of the intermediate relations in your programoutput relationand the program will work correctly), recursive relations and antijoins (which is a requirement of differential dataflow).
Hope this makes sense.
Also I'm curious how you came up with this example. Is this a minimal failing example derived from a realistic program or are you fuzz testing DDlog (that'd be a cool project :) )?
Hi Leonid, This is very interesting. Thanks for the detailed explanation. Yes, it does seem like a crash/warning/better reporting to handle such situations would indeed be very helpful instead of returning an incorrect result. Should we keep this issue open for now or should we close it? Your hunch is right. The example indeed was generated by a random test case generator :)
Should we keep this issue open for now or should we close it?
Let's keep it open.
A crash is preferable to a silent incorrect result. The ideal solution would compute using multisets but would detect overflow and switch to using distinct. I am not sure whether such a dataflow graph can be expressed in DD. Another possible workaround is to have some kind of annotation that the user could put on a relation to indicate that it should be implemented as a set (semantically equivalent to having it be an output relation). But for that we should make it easier for the user to understand when overflow occurs.
I wonder whether some kind of static analysis could be used to estimate the size of the number of derivations.
Resolved via #1080
Crashing is not really a solution. More work is needed.