pdfrx icon indicating copy to clipboard operation
pdfrx copied to clipboard

[Web] Text selection on web always starts at start of line

Open MarcVanDaele90 opened this issue 1 year ago • 11 comments

As mentioned in Issue #4 , text selection on Web has one remaining issue: it always selects complete lines. This can be reproduced when trying to select a couple of words on the demo application https://espresso3389.github.io/pdfrx/

MarcVanDaele90 avatar Sep 18 '24 09:09 MarcVanDaele90

Some more observations (which might be obvious to you).

I noticed the following when opening the same (two-page) pdf

  • on Linux, PdfPageTextPdfium._loadText(...) created 581/292 fragments
  • on Web, PdfPageTextWeb._loadText(...)created only 72/43 fragments

When printing out the text of the resulting PdfPageTextFragment, I noticed that Pdfium seems to add fragments on word level while Web seems to add fragments per line.

This explains why a selection always starts at the beginning of the line I guess.
Not sure whether you can get also word-fragments on web somehow?

MarcVanDaele90 avatar Sep 19 '24 09:09 MarcVanDaele90

You're right. I don't know how to extract word level coodinates with pdf.js. pdf.js example viewer can handle word level coodinates but it uses something provided by HTML canvas or such. I need more research on that...

espresso3389 avatar Sep 19 '24 11:09 espresso3389

Any updates on the text selection feature for the web? It seems there is also an issue with consistency when selecting text. For example, sometimes it misses certain words or skips some parts

StroeAndreX avatar Nov 16 '24 02:11 StroeAndreX

I've just googled the things and found the issue.

It explains the dedicated part to extract text positions is;

  • https://github.com/mozilla/pdf.js/blob/master/src/display/text_layer.js
  • https://github.com/mozilla/pdf.js/blob/master/src/display/display_utils.js
  • https://github.com/mozilla/pdf.js/blob/master/src/shared/util.js

I'll read the codes to know how pdf.js handles text coordinates.

espresso3389 avatar Nov 19 '24 16:11 espresso3389

This is great news! Thanks for the heads up!

MarcVanDaele90 avatar Nov 20 '24 06:11 MarcVanDaele90

I've talked with ChatGPT o3-mini-high, and it resulted in almost identical approach to my previous conclusion.

The following is a rough implementation suggested by him:

// Assume pdfPage is a PDF.js page object and targetChar is the character you want.
pdfPage.getTextContent().then(function(textContent) {
  textContent.items.forEach(function(item) {
    // Check if the current text item contains the character.
    if (item.str.indexOf(targetChar) !== -1) {
      // For simplicity, let’s work with the first occurrence.
      const charIndex = item.str.indexOf(targetChar);
      
      // The text item’s transformation matrix:
      // [ a, b, c, d, e, f ] where (e,f) is the translation (origin) and
      // (a,d) roughly correspond to scaling (and b,c to rotation/skew).
      const [a, b, c, d, e, f] = item.transform;
      
      // Estimate the font size (this is a rough approximation):
      // Many times, sqrt(b*b + d*d) is used as an approximation for the font height.
      const fontSize = Math.sqrt(b * b + d * d);
      
      // The width property of the item is for the entire text string.
      // To get an approximate width for the individual character, you have two options:
      // (1) If you know the font metrics, you could compute the width proportionally.
      // (2) A simpler (but not always perfect) approach is to use a canvas to measure the text.
      // Here we use a canvas to measure the width of the text up to (and including) the character.
      
      // Create a temporary canvas.
      const canvas = document.createElement('canvas');
      const ctx = canvas.getContext('2d');
      
      // NOTE: You need to use the correct font. PDF.js items include a fontName,
      // but you must resolve that to a CSS font string.
      // For demonstration purposes, we assume a default font:
      ctx.font = fontSize + "px sans-serif";
      
      // Measure the width of the text before the target character.
      const textBefore = item.str.substring(0, charIndex);
      const textForChar = item.str.substring(charIndex, charIndex + 1);
      
      const beforeWidth = ctx.measureText(textBefore).width;
      const charWidth = ctx.measureText(textForChar).width;
      
      // Now, calculate the rectangle.
      // The starting point in PDF coordinate space (roughly) is given by (e, f).
      // Then, we offset by the width of the text before the character.
      // The horizontal scaling factor is given by 'a' (if there’s no rotation/skew).
      // (In real cases with rotation you would have to apply the full matrix to all corners.)
      const x = e + beforeWidth * a;
      const y = f - fontSize; // adjust y to account for the font height
      
      const width = charWidth * a;
      const height = fontSize;
      
      console.log("Bounding rectangle for character:", {
        x: x,
        y: y,
        width: width,
        height: height
      });
      
      // If you need to further process or highlight this rectangle, do it here.
    }
  });
});

espresso3389 avatar Feb 06 '25 08:02 espresso3389

Or another approach is to introduce WASM version of pdfium (related #109) though it has several technical challenges...

espresso3389 avatar Feb 06 '25 08:02 espresso3389

#310 is a discussion to introduce WASM version of pdfium.

espresso3389 avatar Feb 06 '25 09:02 espresso3389

@espresso3389 With the new version, Text Selection got extremely better. But there is still one important issue that has to be addressed.

When you start selecting text from the middle of a line and drag downward to the lines below, it doesn’t recognize that it should select the entire line up to the cursor point. As a result, there is a breaking point in the selection, as shown in the attached image:

Image

Might help this note: When you drag from bottom to top, it works as intended, except for the bottom line (the one where you start the selection), which still breaks

Image

In few words, the bottom line breaks.

StroeAndreX avatar Mar 16 '25 18:03 StroeAndreX

@StroeAndreX, at least currently, this is the designed behavior.

There are documents of multiple columns and the selection behavior works great with them:

https://github.com/user-attachments/assets/2622f295-f719-4510-81f1-3d54fd8c16f0

And, even with the behavior, you can of course select to the end of the paragraph reletively easily:

https://github.com/user-attachments/assets/f9f91d9a-24a7-45b3-aa46-641e803223bb

espresso3389 avatar Mar 17 '25 17:03 espresso3389

Frankly speaking, the text selection is a complex system. And there are so many dicisions and challenges during implementing the current behavior.

The current behavior is something I finally realized after over 1-years of investigations and trials. It's not easy to change such kind of UI feasibility.

But I also want to improve the behavior if possible; if you or someone has better implementation, I want to accept such PRs.

espresso3389 avatar Mar 17 '25 17:03 espresso3389

Finally, recent pdfrx web supports PDFium and now pdfrx 2.0.0 has completely re-written text selection mechanism. Please check it!

espresso3389 avatar Jul 18 '25 09:07 espresso3389