Diffing array is very inefficient if deletion/addition occurs anywhere but at the end of the array
Often an array will have an element inserted or deleted from some index within the array. As it stands, deepDiff seems to treat such an event as an insertion or deletion of the final member of the array, and an edit of all the other members. Example:
When a "c" is inserted into the first array: var a = [h, a, p, p, y] var b = [c, h, a, p, p, y]
diff(a,b) will output a bunch of edited values for the array values 0-4 and say that the y has been inserted.
Ideally, for memory savings at least, we could use some sort of fuzzy alignment of the objects being held in the two arrays to find the best match-up of indices and then have a way of notating insertions and deletions.
Do you think this is a feature you could see building in the future? Would your diff system be amenable to something like this or would it break too many aspects of it?
Also note I would be interested in helping build any such system.
Thanks for you time!
The short answer is that I have been unsatisfied with my initial, simplistic approach to array difference. I would love for it to be more robust.
There are consequences to deeper analysis though. Namely, it will degrade performance somewhat due to a multitude of new comparisons on arrays and their items.
I would love to see some code nonetheless because the current implementation is too weak. In the end, if the performance is poor when figuring out array insertions, moves, etc. we can make the array analysis an optional feature.
You could use something like Levenshtein distance for this.
jsondiffpatch does this with an LCS algorithm
@tnrich While it might be memory inefficient, it does mean that the actual running time of diff is faster. The resulting output might make your program run slower, of course, if processing each change record is processing intensive. The alternative is to have a slower differencing algorithm that takes multiple paths and chooses the one that produces the fewest number of changes. This is potentially much slower, tho I have no idea how much.
For me tho, that's what I want, so I'll probably end up building it. If I do I'll report back here.
So I created a differencing module that solves this issue: https://github.com/Tixit/odiff
Hey @flitbit & @tnrich - I created a benchmark suite to test multiple JSON diff frameworks head-to-head: https://github.com/justsml/json-diff-performance
It's coverage is hardly complete, but it's pretty easy to extend - main goals were testing ops/sec and output string byte length.
@fresheneesz I also recently added support for your odiff library!
Hope this helps in performance tuning and comparing various library techniques.
@justsml Very cool!
Yo @justsml thanks for a motivation to finally optimize this thing! Time to turbocharge the MVP... I'll report back soon.
Any updates?
Very cool lib, great work, but this limitation with array comparisons is forcing me to look elsewhere. I don't have the technical prowess to offer a solution, but just adding my two cents that it'd be great to have.
@mikechabot I agree with you. I also have this issue now and need to look into other libraries - only because of this issue.
@pSchaub FWIW, this is working great for me: https://github.com/benjamine/jsondiffpatch. You will need to write your own algo to consume the delta report, but his format is well documented:
https://github.com/benjamine/jsondiffpatch/blob/master/docs/deltas.md