No linesToWords function?
I'm reading the line diff and word diff section of the wiki. It is stated to "make a copy of linesToChars" and call it "linesToWords"... it would be great if this was built into the library already. But for now, is there a sample implementation of linesToWords in JavaScript?
Made some progress by literally doing what the docs said... copy the linesToChars function and change the indexOf. I made the following changes:
` diff_match_patch.prototype.diff_linesToWords_ = function(text1, text2) { var lineArray = []; // e.g. lineArray[4] == 'Hello\n' var lineHash = {}; // e.g. lineHash['Hello\n'] == 4
// So we'll insert a junk entry to avoid generating a null character.
lineArray[0] = '';
/**
* Split a text into an array of strings. Reduce the texts to a string of
* hashes where each Unicode character represents one line.
* Modifies linearray and linehash through being a closure.
* @param {string} text String to encode.
* @return {string} Encoded string.
* @private
*/
/* NEW function */
function regexIndexOf(text, re, i) {
var indexInSuffix = text.slice(i).search(re);
return indexInSuffix < 0 ? indexInSuffix : indexInSuffix + i;
}
function diff_linesToWordsMunge_(text) {
var chars = '';
// Walk the text, pulling out a substring for each line.
// text.split('\n') would would temporarily double our memory footprint.
// Modifying text would create many large strings to garbage collect.
var lineStart = 0;
var lineEnd = -1;
// Keeping our own length variable is faster than looking it up.
var lineArrayLength = lineArray.length;
while (lineEnd < text.length - 1) {
lineEnd = regexIndexOf(text,/\s/,lineStart);//text.indexOf(/^\s+$/, lineStart); //NEW
if (lineEnd == -1) {
lineEnd = text.length - 1;
}
var line = text.substring(lineStart, lineEnd + 1);
if (lineHash.hasOwnProperty ? lineHash.hasOwnProperty(line) :
(lineHash[line] !== undefined)) {
chars += String.fromCharCode(lineHash[line]);
} else {
if (lineArrayLength == maxLines) {
// Bail out at 65535 because
// String.fromCharCode(65536) == String.fromCharCode(0)
line = text.substring(lineStart);
lineEnd = text.length;
}
chars += String.fromCharCode(lineArrayLength);
lineHash[line] = lineArrayLength;
lineArray[lineArrayLength++] = line;
}
lineStart = lineEnd + 1;
}
return chars;
}
// Allocate 2/3rds of the space for text1, the rest for text2.
var maxLines = 40000;
var chars1 = diff_linesToWordsMunge_(text1);
maxLines = 65535;
var chars2 = diff_linesToWordsMunge_(text2);
return {chars1: chars1, chars2: chars2, lineArray: lineArray};
};`
This correctly identifies new words, however, if two new words are side by side then they show up as ONE entry in the resulting diffs (after calling diff_main). I'd like the new words to show up as their own diffs. The individual words do show up as individual elements in the lineArray... thoughts?
On further research, it looks like diff_main squashes edits of the same type together. E.g. two new words side by side become one diff. Any way of keeping them separate?
I have the same issue in the Java code, I replaced the line of code to search by space, but seems like it is not working either