pdf-lib icon indicating copy to clipboard operation
pdf-lib copied to clipboard

page.drawText() inserts spaces when using Thai font

Open robin-dunn opened this issue 4 years ago • 11 comments

What were you trying to do?

I am trying to use the page.drawText() function to render text in the Thai language

Why were you trying to do this?

To build an application that creates PDF files containing text written in the Thai language

How did you attempt to do it?

The steps I followed are:

  • Download Google Noto Sans Thai font
  • Embed the font in the pdf-lib PDF document
  • Invoke the page.drawText() function passing in the text in Thai

See code example provided in reproduction steps section below.

What actually happened?

The PDF file was successfully created but it seems some large spaces have been inserted into the Thai text in the PDF.

I've copied the text from the PDF and pasted below, notice the strange block characters which have been inserted.

แห่งได้เป􏰀ดขึ􏰁นแล้วในการขยายรถไฟใต้ดินลอนดอนครั􏰁งใหญ่ครั􏰁งแรกในศตวรรษนี

Those strange characters appear visually as large blank spaces in the PDF e.g like this:

แห่งได้เป ดขึ นแล้วในการขยายรถไฟใต้ดินลอนดอนครั งใหญ่ครั งแรกในศตวรรษนี

What did you expect to happen?

I expected the Thai text to be rendered as one continuous string without any strange characters or spaces inserted:

แห่งได้เปดขึนแล้วในการขยายรถไฟใต้ดินลอนดอนครังใหญ่ครังแรกในศตวรรษนี

How can we reproduce the issue?

  • Create a Node JS project folder e.g. called 'pdf-test'
  • cd pdf-test
  • npm init -y
  • npm i pdf-lib
  • npm i @pdf-lib/fontkit
  • Download Noto Sans Thai font from https://fonts.google.com/download?family=Noto%20Sans%20Thai
  • Unzip the font and copy the TTF file from Noto_Sans_Thai/static/NotoSansThai/NotoSansThai-Regular.ttf, paste the file into the the project folder pdf-test so it can be loaded by the index.js script below
  • Create a file called index.js and paste the code from below
  • Run the index.js file using the command node index.js which will create the PDF file containing some Thai text
  • Use a PDF viewer/browser e.g. Google Chrome to view the rendered PDF
  • Notice the spacing between some of the Thai text
const fs = require('fs');
const path = require('path');
const { PDFDocument, rgb } = require('pdf-lib');
const fontkit = require('@pdf-lib/fontkit');

(async function run() {

    const pdfDoc = await PDFDocument.create()
    pdfDoc.registerFontkit(fontkit)
    
    // Font downloaded from https://fonts.google.com/download?family=Noto%20Sans%20Thai
    // See also https://fonts.google.com/noto/specimen/Noto+Sans+Thai?query=thai
    const thaiFontBytes = fs.readFileSync(path.join(__dirname, './NotoSansThai-Regular.ttf'))

    const thaiFont = await pdfDoc.embedFont(thaiFontBytes)
    const page = pdfDoc.addPage()
    const { width, height } = page.getSize()

    const fontSize = 11
    page.drawText('แห่งได้เปิดขึ้นแล้วในการขยายรถไฟใต้ดินลอนดอนครั้งใหญ่ครั้งแรกในศตวรรษนี้', {
        x: 50,
        y: height - 2 * fontSize,
        size: fontSize,
        font: thaiFont,
        color: rgb(0, 0.53, 0.71),
    })

    const pdfBytes = await pdfDoc.save()
    fs.writeFile('thai-test.pdf', pdfBytes, () => console.log('PDF file saved.'))
})()

Version

1.16.0

What environment are you running pdf-lib in?

Node

Required Reading

Additional Notes

No response

robin-dunn avatar Oct 01 '21 10:10 robin-dunn

I also face this problem. I guess the bug is in UnicodeLayoutEngine class in @pdf-lib/fontkit lib.

hlab-pawat avatar Oct 02 '21 17:10 hlab-pawat

for me the same with many fonts

chacal88 avatar Oct 11 '21 17:10 chacal88

Hey, I see the same issue here. When I write in document, using fonts by google api, sometimes is added an spaces " " in my text. like this: image

I'm looking for light 💡

pfmartins avatar Oct 11 '21 21:10 pfmartins

@tudor-sandu, is this the issue you guys are experiencing?

cassilup avatar Oct 21 '21 12:10 cassilup

same here with helvetica neue roman and helvetica neue condensed It inserts spaces, for example after the sequence of fi, but not after i or f by itself. For example Backoffice becomes Backoffi ce and fifi becomes fi fi

akomm avatar Nov 11 '21 13:11 akomm

(for Thai font) the issue can be resolved when we use .embedFont(fontBytes, { subset: true }); Don't know why this help.

MetheeS avatar Nov 12 '21 00:11 MetheeS

The effect in the first post is some bytes added to text outside of valid space for the charset. In PDF if there is no character for that byte-sequence (utf8 is multi-byte with variable length), a reader renders it as a space. While when you copy the text, the actual data with the added bytes is copied and when you paste it in a program that renders non-valid/non-printable "chars" as those "glyphs" (the squares in first post), displaying the data as hex (for example 10F0C1), instead of rendered a space.

Also all the examples and my case does not seem like the font just does not have proper glyph for a character.

I also excluded, that some non-printable bytes in the source beforehand. Its being added when rendering the pdf.

https://unicode-table.com/en/search/?q=10F0C1

https://www.unicode.org/charts/PDF/U100000.pdf Quote:

he Supplementary Private Use Area-B block encompasses the entire range of Plane 16. The range U+100000..U+10FFFD is
entirely designated for private use. The last two code points on the plane, U+10FFFE..U+10FFFF, are designated

noncharacters. Consequently, no character code charts or names lists are provided for the majority of this block, except that

a chart and names list are provided for the last 128 code points, to show the location of the noncharacters

akomm avatar Nov 12 '21 08:11 akomm

(for Thai font) the issue can be resolved when we use .embedFont(fontBytes, { subset: true }); Don't know why this help.

This solution is work for font Khmer also.

ponnreay avatar Dec 18 '21 07:12 ponnreay

@akomm

same here with helvetica neue roman and helvetica neue condensed It inserts spaces, for example after the sequence of fi, but not after i or f by itself. For example Backoffice becomes Backoffi ce and fifi becomes fi fi

Try the following await pdfDoc.embedFont(YOURFONT, { features: { liga: false }, });

It definitely is a bug and in my opinion is an issue that should be fixed: https://github.com/Hopding/pdf-lib/issues/490

AgileEduLabs avatar Jan 13 '22 09:01 AgileEduLabs

(for Thai font) the issue can be resolved when we use .embedFont(fontBytes, { subset: true }); Don't know why this help.

This solution also works for Calibri fonts

c-sanchez-fd avatar Jan 22 '24 16:01 c-sanchez-fd