dictdiffer icon indicating copy to clipboard operation
dictdiffer copied to clipboard

Wrong diff in case of removing a non last index of an array

Open Glignos opened this issue 8 years ago • 4 comments

Suppose we have an array

old = ['a','b','c']
new = ['b','c']

# In this case I figured from the documentation that you should expect
# the diff to be ('remove','',[(0,'a')])

# but instead I'm getting:
[('change', [0], ('a', 'b')), ('change', [1], ('b', 'c')), ('remove', '', [(2, 'c')])]

Is that an error or did I misinterpret the usage of the library?

Glignos avatar Jan 08 '18 13:01 Glignos

@Glignos the current algorithm for comparing lists is very simple. In your case you would get a better result if you use sets instead of lists.

You can propose a change of following lines https://github.com/inveniosoftware/dictdiffer/blob/efc1b73742481b57a12833a2b0dfb3b32262c304/dictdiffer/init.py#L159-L161

jirikuncar avatar Jan 08 '18 14:01 jirikuncar

I would propose to use difflib.SequenceMatcher that could do the job quite well.

jirikuncar avatar Jan 08 '18 14:01 jirikuncar

I'm having the same issue, except with more complex data (dictionaries nested in dictionaries). I obviously have no idea what kind of fix this would be, but would love to see the functionality!

lukeolson13 avatar May 09 '18 21:05 lukeolson13

Hey everyone! I found this thread when looking for a solution to calculate diffs between JSON documents. My schema contains a large list of objects at the top of the tree, so the subject issue is a deal breaker. I've prototyped my solution based on some of dictdiffer's code and on difflib.SequenceMatcher, as proposed by @jirikuncar.

The main difficulty is in patch: I store all indices as they appear in the source array, and as I perform the patching I need to keep track of accumulating offset (e.g. if I delete a range, all subsequent indices need to be reduced by the length of deleted segment). Also, I don't need to invert diffs, so I didn't implement this functionality.

Here's the link to my code and here's a little demo. My code uses a different format (structured dicts instead of tuples) but the idea is the same:

a
['a', 'b', 'c', 'y', 'e', 'f', 'x']
b
['b', 'c', 'x', 'e', 'f']

diff
{'path': [], 'action': 'remove_range', 'start': 0, 'length': 1}
{'path': [3], 'action': 'change', 'new': 'x'}
{'path': [], 'action': 'remove_range', 'start': 6, 'length': 1}

Works with documents as well

a
['a', 'b', 'c', {'nested': [{'value': 'y', 'other': 123}]}, 'e', 'f']
b
['extra', 'a', 'b', 'c', {'nested': [{'value': 'x'}]}, 'e', 'f']

diff
{'path': [], 'action': 'add_range', 'start': 0, 'values': ['extra']}
{'path': [3, 'nested', 0, 'value'], 'action': 'change', 'new': 'x'}
{'path': [3, 'nested', 0], 'action': 'remove', 'keys': ['other']}

One difference is that my diffs are not reversible (but I don't think it'll be too hard to add this in case I need it).

nj-vs-vh avatar Jan 25 '24 17:01 nj-vs-vh