Performance issue with regex in HtmlConverterCoreNodeRenderer
HtmlConverterCoreNodeRenderer.handleTableCell has a call to String.replaceAll("\\s*\n\\s*", " ") which can be quite slow. The regex is quite simple and can be sped up by removing the regex.
To Reproduce
See attached file test.html.txt
public class LoadingTest {
public static void main(final String[] args) throws Exception {
final String STR = java.nio.file.Files.readString(java.nio.file.Path.of("test.html.txt"));
final long tic = System.currentTimeMillis();
com.diffbot.websearch.html.MarkdownNormalizer.markdown(STR);
System.out.println("took: " + (System.currentTimeMillis() - tic));
}
}
Expected behavior The code takes >4000 ms to run on my laptop.
took: 4024
It should take much lesser time.
I replaced the regex call with an approximation which is probably ok for html
private String replaceMultipleBlankSpace(String cellText) {
StringBuilder result = new StringBuilder();
boolean wasSpace = false;
for (char c : cellText.toCharArray()) {
if (Character.isWhitespace(c)) {
if (!wasSpace) {
result.append(' ');
wasSpace = true;
}
} else {
result.append(c);
wasSpace = false;
}
}
return result.toString();
}
String cellText = replaceMultipleBlankSpace(context.processTextNodes(element).trim());
The same file takes about 150ms to process
Here's a micro test showing the behaviour of the regular expression versus the non-regex solution:
public class T {
private static final String[] INPUTS = {
"Line one\nLine two",
" Leading spaces\nTrailing spaces ",
"Multiple spaces between words",
"\n\nBlank lines\n\n",
"NoSpaces"
};
private static final String[] OUTPUTS = {
"Line one Line two",
"Leading spaces Trailing spaces",
"Multiple spaces between words",
"Blank lines",
"NoSpaces"
};
public static void main(final String[] args) {
for (var i = 0; i < INPUTS.length; i++) {
final var result = regex(INPUTS[i]);
if (!result.equals(OUTPUTS[i])) {
System.out.println("Test failed:");
System.out.println("Input: " + INPUTS[i]);
System.out.println("Expected: " + OUTPUTS[i]);
System.out.println("Got: " + result);
return;
}
}
System.out.println("ORIGINAL tests passed.");
for (var i = 0; i < INPUTS.length; i++) {
final var result = revision(INPUTS[i]);
if (!result.equals(OUTPUTS[i])) {
System.out.println("REVISION test failed!");
System.out.println("Input: " + INPUTS[i]);
System.out.println("Expected: " + OUTPUTS[i]);
System.out.println("Got: " + result);
return;
}
}
System.out.println("ALL tests passed.");
}
private static String regex(final String text) {
return text.trim().replaceAll("\\s*\n\\s*", " ");
}
private static String revision(final String text) {
final var result = new StringBuilder(text.length());
boolean wasSpace = false;
for (final var c : text.toCharArray()) {
final var isSpace = Character.isWhitespace(c);
final var toAppend = isSpace ? ' ' : c;
if (!wasSpace || !isSpace) {
result.append(toAppend);
}
wasSpace = isSpace;
}
return result.toString().trim();
}
}
I've simplified the algorithm and showed the edge case that fails.
The procedural implementation is not the same as the regular expression.
Notice that instantiating the StringBuilder with text.length() will have modest performance improvements for large strings.
Have you tried pre-compiling the regex as a Pattern constant, instead, to see if there are any performance gains?
That regex really does have the potential to perform really badly (processing some wikipedia pages was taking ~5 minutes).
I think that this variation does the same thing (collapse down all new-line containing whitespace, including other new-lines) much more quickly:
public static String trimWhitespaceContainingNewlines(String html) {
final StringBuilder sb = new StringBuilder();
boolean hasWhitespace = false;
boolean hasNewline = false;
int startIndex = -1;
int stopIndex = -1;
for (int i = 0; i < html.length(); i++) {
final char c = html.charAt(i);
if (c == '\n') {
hasNewline = true;
} else if (Character.isWhitespace(c)) {
if (!hasWhitespace) {
// entering a new sequence of whitespace
hasWhitespace = true;
startIndex = i;
stopIndex = i + 1;
} else {
// already in whitespace, bump up the stop index
stopIndex = i + 1;
}
} else {
// possible end of whitespace sequence - if there was a newline replace
// it all with a space, else we preserve whatever whitespace we were
// considering skipping
if (hasNewline) {
sb.append(' ');
} else if (hasWhitespace) {
sb.append(html.substring(startIndex, stopIndex));
}
sb.append(c);
hasWhitespace = false;
hasNewline = false;
}
}
// we might end on a whitespace character, so either trim
// or output the collected whitespace
if (hasNewline) {
sb.append(' ');
} else if (hasWhitespace) {
sb.append(html.substring(startIndex, stopIndex));
}
return sb.toString();
}
Here's DeepSeek's take (there are still improvements that can be made):
public static String trimWhitespaceContainingNewlines(String html) {
StringBuilder sb = new StringBuilder();
boolean inWhitespace = false;
boolean hasNewline = false;
int whitespaceStart = 0;
for (int i = 0; i < html.length(); i++) {
char c = html.charAt(i);
if (c == '\n') {
hasNewline = true;
if (!inWhitespace) {
inWhitespace = true;
whitespaceStart = i;
}
} else if (Character.isWhitespace(c)) {
if (!inWhitespace) {
inWhitespace = true;
whitespaceStart = i;
}
} else {
if (inWhitespace) {
if (hasNewline) {
sb.append(' ');
} else {
sb.append(html, whitespaceStart, i);
}
inWhitespace = false;
hasNewline = false;
}
sb.append(c);
}
}
// Handle trailing whitespace
if (inWhitespace) {
if (hasNewline) {
sb.append(' ');
} else {
sb.append(html, whitespaceStart, html.length());
}
}
return sb.toString();
}
Here's Claude's take:
public static String trimWhitespaceContainingNewlines(String html) {
if (html == null || html.isEmpty()) {
return html;
}
StringBuilder sb = new StringBuilder(html.length());
int i = 0;
while (i < html.length()) {
char c = html.charAt(i);
if (Character.isWhitespace(c)) {
// Start of whitespace sequence
int startIndex = i;
boolean hasNewline = c == '\n';
// Consume all consecutive whitespace
i++;
while (i < html.length() && Character.isWhitespace(html.charAt(i))) {
if (html.charAt(i) == '\n') {
hasNewline = true;
}
i++;
}
// Replace sequence with space if it had a newline, otherwise preserve it
if (hasNewline) {
sb.append(' ');
} else {
sb.append(html, startIndex, i);
}
} else {
// Non-whitespace character
sb.append(c);
i++;
}
}
return sb.toString();
}
In both cases, there are other improvements:
- Extract the invariant
html.length()into a constant. - The value of
html.charAt(i)can be made a constant. - The
String htmlparam can be madefinal.
I asked it to make more optimizations and it spat out:
public static String trimWhitespaceContainingNewlines(String html) {
if (html == null || html.isEmpty()) {
return html;
}
final int length = html.length();
StringBuilder sb = new StringBuilder(length);
int i = 0;
while (i < length) {
final char c = html.charAt(i++);
if (!Character.isWhitespace(c)) {
// Non-whitespace character - fast path
sb.append(c);
continue;
}
// Start of whitespace sequence
boolean hasNewline = c == '\n';
final int startIndex = i - 1;
// Consume all consecutive whitespace
while (i < length) {
final char nextChar = html.charAt(i);
if (!Character.isWhitespace(nextChar)) {
break;
}
hasNewline |= nextChar == '\n';
i++;
}
// Replace sequence with space if it had a newline, otherwise preserve it
if (hasNewline) {
sb.append(' ');
} else {
sb.append(html, startIndex, i);
}
}
return sb.toString();
}
That final version looks pretty good to my eye. I take minor issue with:
while (i < length) {
final char c = html.charAt(i++);
It is more idiomatic to increment the index within the looping construct, IMO.
Here's one more version that looks pretty optimal:
public static String trimWhitespaceContainingNewlines(final String html) {
if (html == null || html.isEmpty()) {
return html;
}
final int length = html.length();
final StringBuilder sb = new StringBuilder(length);
int i = 0;
while (i < length) {
final char c = html.charAt(i++);
if (Character.isWhitespace(c)) {
boolean hasNewline = c == '\n';
final int startIndex = i - 1;
while (i < length) {
final char nextChar = html.charAt(i);
if (Character.isWhitespace(nextChar)) {
hasNewline |= nextChar == '\n';
i++;
} else {
break;
}
}
if (hasNewline) {
sb.append(' ');
} else {
sb.append(html, startIndex, i);
}
} else {
sb.append(c);
}
}
return sb.toString();
}