flexmark-java icon indicating copy to clipboard operation
flexmark-java copied to clipboard

Performance issue with regex in HtmlConverterCoreNodeRenderer

Open praveen-diffbot opened this issue 1 year ago • 5 comments

HtmlConverterCoreNodeRenderer.handleTableCell has a call to String.replaceAll("\\s*\n\\s*", " ") which can be quite slow. The regex is quite simple and can be sped up by removing the regex.

To Reproduce

See attached file test.html.txt

public class LoadingTest {
  public static void main(final String[] args) throws Exception {
    final String STR = java.nio.file.Files.readString(java.nio.file.Path.of("test.html.txt"));
    final long tic = System.currentTimeMillis();
    com.diffbot.websearch.html.MarkdownNormalizer.markdown(STR);
    System.out.println("took: " + (System.currentTimeMillis() - tic));
  }
}

Expected behavior The code takes >4000 ms to run on my laptop.

took: 4024

It should take much lesser time.

praveen-diffbot avatar Sep 17 '24 01:09 praveen-diffbot

I replaced the regex call with an approximation which is probably ok for html

 private String replaceMultipleBlankSpace(String cellText) {
     StringBuilder result = new StringBuilder();
     boolean wasSpace = false;

     for (char c : cellText.toCharArray()) {
         if (Character.isWhitespace(c)) {
             if (!wasSpace) {
                 result.append(' ');
                 wasSpace = true;
             }
         } else {
             result.append(c);
             wasSpace = false;
         }
     }

     return result.toString();
}
String cellText = replaceMultipleBlankSpace(context.processTextNodes(element).trim());

The same file takes about 150ms to process

praveen-diffbot avatar Sep 17 '24 01:09 praveen-diffbot

Here's a micro test showing the behaviour of the regular expression versus the non-regex solution:

public class T {
  private static final String[] INPUTS = {
    "Line one\nLine two",
    "   Leading spaces\nTrailing spaces   ",
    "Multiple  spaces between words",
    "\n\nBlank lines\n\n",
    "NoSpaces"
  };

  private static final String[] OUTPUTS = {
    "Line one Line two",
    "Leading spaces Trailing spaces",
    "Multiple  spaces between words",
    "Blank lines",
    "NoSpaces"
  };

  public static void main(final String[] args) {
    for (var i = 0; i < INPUTS.length; i++) {
      final var result = regex(INPUTS[i]);

      if (!result.equals(OUTPUTS[i])) {
        System.out.println("Test failed:");
        System.out.println("Input: " + INPUTS[i]);
        System.out.println("Expected: " + OUTPUTS[i]);
        System.out.println("Got: " + result);
        return;
      }
    }

    System.out.println("ORIGINAL tests passed.");

    for (var i = 0; i < INPUTS.length; i++) {
      final var result = revision(INPUTS[i]);

      if (!result.equals(OUTPUTS[i])) {
        System.out.println("REVISION test failed!");
        System.out.println("Input: " + INPUTS[i]);
        System.out.println("Expected: " + OUTPUTS[i]);
        System.out.println("Got: " + result);
        return;
      }
    }

    System.out.println("ALL tests passed.");
  }

  private static String regex(final String text) {
    return text.trim().replaceAll("\\s*\n\\s*", " ");
  }

  private static String revision(final String text) {
    final var result = new StringBuilder(text.length());
    boolean wasSpace = false;

    for (final var c : text.toCharArray()) {
      final var isSpace = Character.isWhitespace(c);
      final var toAppend = isSpace ? ' ' : c;

      if (!wasSpace || !isSpace) {
        result.append(toAppend);
      }

      wasSpace = isSpace;
    }

    return result.toString().trim();
  }
}

I've simplified the algorithm and showed the edge case that fails.

The procedural implementation is not the same as the regular expression.

Notice that instantiating the StringBuilder with text.length() will have modest performance improvements for large strings.

Have you tried pre-compiling the regex as a Pattern constant, instead, to see if there are any performance gains?

DarkTyger avatar Oct 14 '24 23:10 DarkTyger

That regex really does have the potential to perform really badly (processing some wikipedia pages was taking ~5 minutes).

I think that this variation does the same thing (collapse down all new-line containing whitespace, including other new-lines) much more quickly:

    public static String trimWhitespaceContainingNewlines(String html) {
        final StringBuilder sb = new StringBuilder();

        boolean hasWhitespace = false;
        boolean hasNewline = false;

        int startIndex = -1;
        int stopIndex = -1;

        for (int i = 0; i < html.length(); i++) {
            final char c = html.charAt(i);

            if (c == '\n') {
                hasNewline = true;
            } else if (Character.isWhitespace(c)) {
                if (!hasWhitespace) {
                    // entering a new sequence of whitespace
                    hasWhitespace = true;
                    startIndex = i;
                    stopIndex = i + 1;
                } else {
                    // already in whitespace, bump up the stop index
                    stopIndex = i + 1;
                }
            } else {
                // possible end of whitespace sequence - if there was a newline replace
                // it all with a space, else we preserve whatever whitespace we were
                // considering skipping
                if (hasNewline) {
                    sb.append(' ');
                } else if (hasWhitespace) {
                    sb.append(html.substring(startIndex, stopIndex));
                }
                sb.append(c);
                hasWhitespace = false;
                hasNewline = false;
            }
        }

        // we might end on a whitespace character, so either trim
        // or output the collected whitespace
        if (hasNewline) {
            sb.append(' ');
        } else if (hasWhitespace) {
            sb.append(html.substring(startIndex, stopIndex));
        }

        return sb.toString();
    }

thefoy avatar Mar 19 '25 19:03 thefoy

Here's DeepSeek's take (there are still improvements that can be made):

public static String trimWhitespaceContainingNewlines(String html) {
    StringBuilder sb = new StringBuilder();
    boolean inWhitespace = false;
    boolean hasNewline = false;
    int whitespaceStart = 0;

    for (int i = 0; i < html.length(); i++) {
        char c = html.charAt(i);

        if (c == '\n') {
            hasNewline = true;
            if (!inWhitespace) {
                inWhitespace = true;
                whitespaceStart = i;
            }
        } else if (Character.isWhitespace(c)) {
            if (!inWhitespace) {
                inWhitespace = true;
                whitespaceStart = i;
            }
        } else {
            if (inWhitespace) {
                if (hasNewline) {
                    sb.append(' ');
                } else {
                    sb.append(html, whitespaceStart, i);
                }
                inWhitespace = false;
                hasNewline = false;
            }
            sb.append(c);
        }
    }

    // Handle trailing whitespace
    if (inWhitespace) {
        if (hasNewline) {
            sb.append(' ');
        } else {
            sb.append(html, whitespaceStart, html.length());
        }
    }

    return sb.toString();
}

Here's Claude's take:

public static String trimWhitespaceContainingNewlines(String html) {
    if (html == null || html.isEmpty()) {
        return html;
    }
    
    StringBuilder sb = new StringBuilder(html.length());
    int i = 0;
    while (i < html.length()) {
        char c = html.charAt(i);
        
        if (Character.isWhitespace(c)) {
            // Start of whitespace sequence
            int startIndex = i;
            boolean hasNewline = c == '\n';
            
            // Consume all consecutive whitespace
            i++;
            while (i < html.length() && Character.isWhitespace(html.charAt(i))) {
                if (html.charAt(i) == '\n') {
                    hasNewline = true;
                }
                i++;
            }
            
            // Replace sequence with space if it had a newline, otherwise preserve it
            if (hasNewline) {
                sb.append(' ');
            } else {
                sb.append(html, startIndex, i);
            }
        } else {
            // Non-whitespace character
            sb.append(c);
            i++;
        }
    }
    
    return sb.toString();
}

In both cases, there are other improvements:

  • Extract the invariant html.length() into a constant.
  • The value of html.charAt(i) can be made a constant.
  • The String html param can be made final.

I asked it to make more optimizations and it spat out:

public static String trimWhitespaceContainingNewlines(String html) {
    if (html == null || html.isEmpty()) {
        return html;
    }
    
    final int length = html.length();
    StringBuilder sb = new StringBuilder(length);
    int i = 0;
    
    while (i < length) {
        final char c = html.charAt(i++);
        
        if (!Character.isWhitespace(c)) {
            // Non-whitespace character - fast path
            sb.append(c);
            continue;
        }
        
        // Start of whitespace sequence
        boolean hasNewline = c == '\n';
        final int startIndex = i - 1;
        
        // Consume all consecutive whitespace
        while (i < length) {
            final char nextChar = html.charAt(i);
            if (!Character.isWhitespace(nextChar)) {
                break;
            }
            hasNewline |= nextChar == '\n';
            i++;
        }
        
        // Replace sequence with space if it had a newline, otherwise preserve it
        if (hasNewline) {
            sb.append(' ');
        } else {
            sb.append(html, startIndex, i);
        }
    }
    
    return sb.toString();
}

That final version looks pretty good to my eye. I take minor issue with:

    while (i < length) {
        final char c = html.charAt(i++);

It is more idiomatic to increment the index within the looping construct, IMO.

DarkTyger avatar Mar 20 '25 02:03 DarkTyger

Here's one more version that looks pretty optimal:

public static String trimWhitespaceContainingNewlines(final String html) {
    if (html == null || html.isEmpty()) {
        return html;
    }
    
    final int length = html.length();
    final StringBuilder sb = new StringBuilder(length);
    int i = 0;
    
    while (i < length) {
        final char c = html.charAt(i++);
        
        if (Character.isWhitespace(c)) {
            boolean hasNewline = c == '\n';
            final int startIndex = i - 1;
            
            while (i < length) {
                final char nextChar = html.charAt(i);
                if (Character.isWhitespace(nextChar)) {
                    hasNewline |= nextChar == '\n';
                    i++;
                } else {
                    break;
                }
            }
            
            if (hasNewline) {
                sb.append(' ');
            } else {
                sb.append(html, startIndex, i);
            }
        } else {
            sb.append(c);
        }
    }
    
    return sb.toString();
}

DarkTyger avatar Mar 20 '25 02:03 DarkTyger