Is it possible to create a PDF with UTF-8 character encoding?
This is my failing test in kotlin:
@Test
fun test_parseToPdf_convertsMarkdownToPdfWithUTF8CharacterSet() {
val markdown = "Общие"
val inputStream = markdownParser.parseToPdf(markdown, "test").inputStream()
val fileText = pdfText(inputStream)
assertThat(fileText).contains("Общие")
}
private fun pdfText(input: InputStream): String? {
try {
var document = PDDocument.load(input)
val stripper = PDFTextStripper()
return stripper.getText(document)
} catch (e: Exception) {
e.printStackTrace()
}
return null
}
This is my parser class:
class MarkdownParser(private val parser: Parser,
private val htmlRenderer: HtmlRenderer) {
fun parseToHtml(markdownContent: String): String {
val document = parser.parse(markdownContent)
return htmlRenderer.render(document)
}
fun parseToPdf(markdownContent: String, path: String): ByteArray {
val options = PegdownOptionsAdapter.flexmarkOptions(
Extensions.ALL and (Extensions.ANCHORLINKS or Extensions.EXTANCHORLINKS_WRAP).inv()
).toMutable()
val html = parseToHtml(markdownContent)
val out = ByteArrayOutputStream()
PdfConverterExtension.exportToPdf(out, html, path, options)
return out.toByteArray()
}
}
Parser is com.vladsch.flexmark.parse.Parser, HtmlRenderer is com.vladsch.flexmark.html.HtmlRenderer.
As I am just passing Outputstream to the PdfConverterExtension I don't have control in writing the data. Is there a possibility to create PDF with UTF-8 Characters? The html content still has the correct HTML encoding
@shaolinh84, I am looking into it because it seems that the openhtmltopdf is not converting the characters in the HTML (taken from the String variable passed to openhtmltopdf):
<html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"></head><body>
<ul>
<li>Test PDF with Unicode chars: Общие</li>
</ul>
</body></html>
The resulting PDF is:

It could be some configuration that is missing.
@shaolinh84, it seems that the PDF conversion depends on the fonts which are used and whether they have the given Unicode characters.
You should skip the flexmark-java PDF converter and build your PDF conversion with the code used in the converter and add fonts available in the PDF. I have not done this yet so it is a theoretical solution.
The code in PDF converter extension is:
public static void exportToPdf(final OutputStream os, final String html, final String url, final PdfRendererBuilder.TextDirection defaultTextDirection) {
try {
// There are more options on the builder than shown below.
PdfRendererBuilder builder = new PdfRendererBuilder();
if (defaultTextDirection != null) {
builder.useUnicodeBidiSplitter(new ICUBidiSplitter.ICUBidiSplitterFactory());
builder.useUnicodeBidiReorderer(new ICUBidiReorderer());
builder.defaultTextDirection(defaultTextDirection); // OR RTL
}
org.jsoup.nodes.Document doc;
doc = Jsoup.parse(html);
Document dom = DOMBuilder.jsoup2DOM(doc);
builder.withW3cDocument(dom, url);
builder.toStream(os);
builder.run();
} catch (Exception e) {
e.printStackTrace();
// LOG exception
} finally {
try {
os.close();
} catch (IOException e) {
// swallow
}
}
}
The pdf renderer builder has a function to add a font to the pdf conversion, from what I understand.
public PdfRendererBuilder useFont(FSSupplier<InputStream> supplier, String fontFamily, Integer fontWeight, PdfRendererBuilder.FontStyle fontStyle, boolean subset) {
this._fonts.add(new PdfRendererBuilder.AddedFont(supplier, fontWeight, fontFamily, subset, fontStyle));
return this;
}
A better example is in the issues for openhtmltopdf: https://github.com/danfickle/openhtmltopdf/issues/129
I have some isssue...
A solution to the font problem is to define an embedded TrueType font in the style or stylesheet and set the body tag to use this font. OpenHtmlToPDF will use the characters from the font which has them defined.
For example including Noto Serif/Sans/Mono fonts and adding noto-serif, noto-sans and noto-mono families to CSS to allow PDF to use these for rendering text.
However, the PDF converter requires TrueType fonts and Noto CJK fonts are OpenFonts which cannot be used. The solution is to download a TrueType Unicode font that supports CJK character set and add it to the custom rendering profile to be used for PDF.
For my test I used arialuni.ttf from https://www.wfonts.com/font/arial-unicode-ms
If the installation directory for the fonts is /usr/local/fonts/ then the following in the stylesheet should be added:
@font-face {
font-family: 'noto-cjk';
src: url('file:///usr/local/fonts/arialuni.ttf');
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'noto-serif';
src: url('file:///usr/local/fonts/NotoSerif-Regular.ttf');
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'noto-serif';
src: url('file:///usr/local/fonts/NotoSerif-Bold.ttf');
font-weight: bold;
font-style: normal;
}
@font-face {
font-family: 'noto-serif';
src: url('file:///usr/local/fonts/NotoSerif-BoldItalic.ttf');
font-weight: bold;
font-style: italic;
}
@font-face {
font-family: 'noto-serif';
src: url('file:///usr/local/fonts/NotoSerif-Italic.ttf');
font-weight: normal;
font-style: italic;
}
@font-face {
font-family: 'noto-sans';
src: url('file:///usr/local/fonts/NotoSans-Regular.ttf');
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'noto-sans';
src: url('file:///usr/local/fonts/NotoSans-Bold.ttf');
font-weight: bold;
font-style: normal;
}
@font-face {
font-family: 'noto-sans';
src: url('file:///usr/local/fonts/NotoSans-BoldItalic.ttf');
font-weight: bold;
font-style: italic;
}
@font-face {
font-family: 'noto-sans';
src: url('file:///usr/local/fonts/NotoSans-Italic.ttf');
font-weight: normal;
font-style: italic;
}
@font-face {
font-family: 'noto-mono';
src: url('file:///usr/local/fonts/NotoMono-Regular.ttf');
font-weight: normal;
font-style: normal;
}
body {
font-family: 'noto-sans', 'noto-cjk', sans-serif;
overflow: hidden;
word-wrap: break-word;
font-size: 14px;
}
var,
code,
kbd,
pre {
font: 0.9em 'noto-mono', Consolas, "Liberation Mono", Menlo, Courier, monospace;
}
Sample PdfConverter.java updated. Wiki Page with information added: PDF-Renderer-Converter