annoying warning "Syntax Warning: Could not parse ligature component" and possible solution to suppress these messages..

Open g-rd opened this issue 3 years ago • 1 comments

I am getting such warnings on some pdf-s that come from pdftotext and originate from poppler, these messages are in stderr and pdftotext has an option to suppress them with the flag "-q".

Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "no" of "no_break_space" in parseCharName
Syntax Warning: Could not parse ligature component "break" of "no_break_space" in parseCharName

Please add suppression option of poppler warnings or at least handle stderr in the pdftotext wrapper. Currently the warnings are sent to stderr and can't be caught (to my understanding).

Below I added an option, where subprocess errors from stderr are sent to logger as warning so its easy to suppress them when not needed.


def to_text(path: str, area_details: dict = None):
    """Wrapper around Poppler pdftotext.

    Parameters
    ----------
    path : str
        path of electronic invoice in PDF
    area_details : dictionary
        of the format {x: int, y: int, r: int, W: int, H: int}
        used when extracting an area of the pdf rather than the whole document

    Returns
    -------
    out : str
        returns extracted text from pdf

    Raises
    ------
    EnvironmentError:
        If pdftotext library is not found
    """
    import subprocess
    from distutils import spawn  # py2 compat

    if spawn.find_executable("pdftotext"):  # shutil.which('pdftotext'):
        cmd = ["pdftotext", "-layout", "-enc", "UTF-8"]
        if area_details is not None:
            # An area was specified
            # Validate the required keys were provided
            assert 'f' in area_details, 'Area r details missing'
            assert 'l' in area_details, 'Area r details missing'
            assert 'r' in area_details, 'Area r details missing'
            assert 'x' in area_details, 'Area x details missing'
            assert 'y' in area_details, 'Area y details missing'
            assert 'W' in area_details, 'Area W details missing'
            assert 'H' in area_details, 'Area H details missing'
            # Convert all of the values to strings
            for key in area_details.keys():
                area_details[key] = str(area_details[key])
            cmd += [
                '-f', area_details['f'],
                '-l', area_details['l'],
                '-r', area_details['r'],
                '-x', area_details['x'],
                '-y', area_details['y'],
                '-W', area_details['W'],
                '-H', area_details['H'],
            ]
        cmd += [path, "-"]
        # Run the extraction
        out, err = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
        if err:
            errors = err.decode().split("\n")
            for er in errors:
                logger.debug(er)
    else:
        raise EnvironmentError(
            "pdftotext not installed. Can be downloaded from https://poppler.freedesktop.org/"
        )

Feb 01 '23 21:02 g-rd

I was thinking about adding a -q option for some time now (since I added some info prints actually). It may be a good idea to combine it with pdftotext. I'm planning to work on that.

Feb 03 '23 22:02 rmilecki