go-diff icon indicating copy to clipboard operation
go-diff copied to clipboard

Munged text leads to incorrect diffs

Open schroederc opened this issue 2 years ago • 5 comments

https://github.com/sergi/go-diff/commit/db1b095f5e7c905e196ff6bfd56189a41aa76309 introduces a bug in its change from diffLinesToRunesMunge to diffLinesToStringsMunge. Since each line is represented by 1 or more ascii characters, it's possible for the diffing algorithm to split hashed lines incorrectly whereas before the rune indexed lines were indivisible.

For instance, DiffLinesToChars could return hashed strings such as:

"1,2,3,4,5,6,7,8,9,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,9,27,28,9,29,30,31,32,33,9,34,35,36,37,38,39,40,41"
"42,9,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,9,27,28,9,29,30,31,32,33,9,34,43,36,37,38,39,44,45,46,47"

DiffMain may then split the leading 42 such as:

[{Delete 1,2,3,} {Equal 4} {Delete ,5,6,7,8} {Insert 2} {Equal ,9,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,9,27,28,9,29,30,31,32,33,9,34,} {Insert 4} {Equal 3} {Delete 5} {Equal ,36,37,38,39,4} {Delete 0} {Insert 4} {Equal ,4} {Delete 1} {Insert 5,46,47}] 

And the resulting diff after hydration is completely wrong.

This affects users of the DiffLinesTo* APIs as well as any user that passes true for checklines in DiffMain or DiffMainRunes.

schroederc avatar Mar 29 '23 21:03 schroederc

Here's a rough test case:

func TestLineDiff(t *testing.T) {
	t.Run("Chars", func(t *testing.T) {
		before := `1
2
3
4
5
6
7
8
9
`
		after := `10
`

		dmp := diffmatchpatch.New()
		txt1, txt2, lines := dmp.DiffLinesToChars(string(before), string(after))
		diff := dmp.DiffMain(txt1, txt2, false)
		diff = dmp.DiffCharsToLines(diff, lines)

		var foundBefore, foundAfter string
		for _, d := range diff {
			switch d.Type {
			case eq:
				foundBefore += d.Text
				foundAfter += d.Text
			case del:
				foundBefore += d.Text
			case ins:
				foundAfter += d.Text
			}
		}

		if foundBefore != before {
			t.Errorf("Expected before %q; found %q", before, foundBefore)
		}
		if foundAfter != after {
			t.Errorf("Expected after %q; found %q", after, foundAfter)
		}
	})

	t.Run("Runes", func(t *testing.T) {
		before := `1
2
3
4
5
6
7
8
9
`
		after := `10
`

		dmp := diffmatchpatch.New()
		txt1, txt2, lines := dmp.DiffLinesToRunes(string(before), string(after))
		diff := dmp.DiffMainRunes(txt1, txt2, false)
		diff = dmp.DiffCharsToLines(diff, lines)

		var foundBefore, foundAfter string
		for _, d := range diff {
			switch d.Type {
			case eq:
				foundBefore += d.Text
				foundAfter += d.Text
			case del:
				foundBefore += d.Text
			case ins:
				foundAfter += d.Text
			}
		}

		if foundBefore != before {
			t.Errorf("Expected before %q; found %q", before, foundBefore)
		}
		if foundAfter != after {
			t.Errorf("Expected after %q; found %q", after, foundAfter)
		}
	})
}

schroederc avatar Mar 30 '23 15:03 schroederc

I sent out https://github.com/sergi/go-diff/pull/141 to revert the implementation to the limited, but not incorrect, rune approach. Some more fundamental changes would probably be preferable if someone has more time to work on it.

schroederc avatar Mar 31 '23 19:03 schroederc

Ran into this as well. I don't need to diff large chunks (context: https://github.com/sergi/go-diff/issues/89#issuecomment-591376325) so I downgraded to v1.1.0, which seems to be, still to this day, widely used (i.e. go-git).

ernestrc avatar Nov 16 '23 22:11 ernestrc

Is there any progress on this issue?

fcharlie avatar Jan 04 '24 03:01 fcharlie