flexmark-java icon indicating copy to clipboard operation
flexmark-java copied to clipboard

Is it possible to create a PDF with UTF-8 character encoding?

Open shaolinh84 opened this issue 8 years ago • 4 comments

This is my failing test in kotlin:

    @Test
    fun test_parseToPdf_convertsMarkdownToPdfWithUTF8CharacterSet() {
        val markdown = "Общие"
        val inputStream = markdownParser.parseToPdf(markdown, "test").inputStream()
        val fileText = pdfText(inputStream)
        assertThat(fileText).contains("Общие")
    }

    private fun pdfText(input: InputStream): String? {
        try {
            var document =  PDDocument.load(input)
            val stripper = PDFTextStripper()
            return stripper.getText(document)
        } catch (e: Exception) {
            e.printStackTrace()
        }
        return null
    }

This is my parser class:

class MarkdownParser(private val parser: Parser,
                     private val htmlRenderer: HtmlRenderer) {

    fun parseToHtml(markdownContent: String): String {
        val document = parser.parse(markdownContent)
        return htmlRenderer.render(document)
    }

    fun parseToPdf(markdownContent: String, path: String): ByteArray {
        val options = PegdownOptionsAdapter.flexmarkOptions(
                Extensions.ALL and (Extensions.ANCHORLINKS or Extensions.EXTANCHORLINKS_WRAP).inv()
        ).toMutable()

        val html = parseToHtml(markdownContent)
        val out = ByteArrayOutputStream()

        PdfConverterExtension.exportToPdf(out, html, path, options)

        return out.toByteArray()
    }

}

Parser is com.vladsch.flexmark.parse.Parser, HtmlRenderer is com.vladsch.flexmark.html.HtmlRenderer.

As I am just passing Outputstream to the PdfConverterExtension I don't have control in writing the data. Is there a possibility to create PDF with UTF-8 Characters? The html content still has the correct HTML encoding

shaolinh84 avatar Dec 08 '17 14:12 shaolinh84

@shaolinh84, I am looking into it because it seems that the openhtmltopdf is not converting the characters in the HTML (taken from the String variable passed to openhtmltopdf):

<html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"></head><body>
<ul>
    <li>Test PDF with Unicode chars: Общие</li>
</ul>

</body></html>

The resulting PDF is:

image

It could be some configuration that is missing.

vsch avatar Dec 08 '17 18:12 vsch

@shaolinh84, it seems that the PDF conversion depends on the fonts which are used and whether they have the given Unicode characters.

You should skip the flexmark-java PDF converter and build your PDF conversion with the code used in the converter and add fonts available in the PDF. I have not done this yet so it is a theoretical solution.

The code in PDF converter extension is:

    public static void exportToPdf(final OutputStream os, final String html, final String url, final PdfRendererBuilder.TextDirection defaultTextDirection) {
        try {
            // There are more options on the builder than shown below.
            PdfRendererBuilder builder = new PdfRendererBuilder();

            if (defaultTextDirection != null) {
                builder.useUnicodeBidiSplitter(new ICUBidiSplitter.ICUBidiSplitterFactory());
                builder.useUnicodeBidiReorderer(new ICUBidiReorderer());
                builder.defaultTextDirection(defaultTextDirection); // OR RTL
            }

            org.jsoup.nodes.Document doc;
            doc = Jsoup.parse(html);

            Document dom = DOMBuilder.jsoup2DOM(doc);
            builder.withW3cDocument(dom, url);
            builder.toStream(os);
            builder.run();
        } catch (Exception e) {
            e.printStackTrace();
            // LOG exception
        } finally {
            try {
                os.close();
            } catch (IOException e) {
                // swallow
            }
        }
    }

The pdf renderer builder has a function to add a font to the pdf conversion, from what I understand.

    public PdfRendererBuilder useFont(FSSupplier<InputStream> supplier, String fontFamily, Integer fontWeight, PdfRendererBuilder.FontStyle fontStyle, boolean subset) {
        this._fonts.add(new PdfRendererBuilder.AddedFont(supplier, fontWeight, fontFamily, subset, fontStyle));
        return this;
    }

A better example is in the issues for openhtmltopdf: https://github.com/danfickle/openhtmltopdf/issues/129

vsch avatar Dec 08 '17 19:12 vsch

I have some isssue...

a-reznic avatar Dec 18 '17 13:12 a-reznic

A solution to the font problem is to define an embedded TrueType font in the style or stylesheet and set the body tag to use this font. OpenHtmlToPDF will use the characters from the font which has them defined.

For example including Noto Serif/Sans/Mono fonts and adding noto-serif, noto-sans and noto-mono families to CSS to allow PDF to use these for rendering text.

However, the PDF converter requires TrueType fonts and Noto CJK fonts are OpenFonts which cannot be used. The solution is to download a TrueType Unicode font that supports CJK character set and add it to the custom rendering profile to be used for PDF.

For my test I used arialuni.ttf from https://www.wfonts.com/font/arial-unicode-ms

If the installation directory for the fonts is /usr/local/fonts/ then the following in the stylesheet should be added:

@font-face {
  font-family: 'noto-cjk';
  src: url('file:///usr/local/fonts/arialuni.ttf');
  font-weight: normal;
  font-style: normal;
}

@font-face {
  font-family: 'noto-serif';
  src: url('file:///usr/local/fonts/NotoSerif-Regular.ttf');
  font-weight: normal;
  font-style: normal;
}

@font-face {
  font-family: 'noto-serif';
  src: url('file:///usr/local/fonts/NotoSerif-Bold.ttf');
  font-weight: bold;
  font-style: normal;
}

@font-face {
  font-family: 'noto-serif';
  src: url('file:///usr/local/fonts/NotoSerif-BoldItalic.ttf');
  font-weight: bold;
  font-style: italic;
}

@font-face {
  font-family: 'noto-serif';
  src: url('file:///usr/local/fonts/NotoSerif-Italic.ttf');
  font-weight: normal;
  font-style: italic;
}

@font-face {
  font-family: 'noto-sans';
  src: url('file:///usr/local/fonts/NotoSans-Regular.ttf');
  font-weight: normal;
  font-style: normal;
}

@font-face {
  font-family: 'noto-sans';
  src: url('file:///usr/local/fonts/NotoSans-Bold.ttf');
  font-weight: bold;
  font-style: normal;
}

@font-face {
  font-family: 'noto-sans';
  src: url('file:///usr/local/fonts/NotoSans-BoldItalic.ttf');
  font-weight: bold;
  font-style: italic;
}

@font-face {
  font-family: 'noto-sans';
  src: url('file:///usr/local/fonts/NotoSans-Italic.ttf');
  font-weight: normal;
  font-style: italic;
}


@font-face {
  font-family: 'noto-mono';
  src: url('file:///usr/local/fonts/NotoMono-Regular.ttf');
  font-weight: normal;
  font-style: normal;
}

body {
    font-family: 'noto-sans', 'noto-cjk', sans-serif;
    overflow: hidden;
    word-wrap: break-word;
    font-size: 14px;
}

var,
code,
kbd,
pre {
    font: 0.9em 'noto-mono', Consolas, "Liberation Mono", Menlo, Courier, monospace;
}

Sample PdfConverter.java updated. Wiki Page with information added: PDF-Renderer-Converter

vsch avatar Jan 24 '19 22:01 vsch