PyMuPDF Improvements to the logging logic

This PR makes the following improvements to the logging logic of the fitz module:

Removes some commented-out print() / log() call, which are no longer necessary.
Logs by default to stderr.
Replaces any print() / sys.stdout.write() / sys.stderr.write() / PySys_Write* calls with just our log() helper.
Fixes a bug that unexpectedly made sys.stdout point to sys.stderr.

Fixes #3135

Feb 06 '24 21:02 apyrgio

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

Feb 06 '24 21:02 github-actions[bot]

I have read the CLA Document and I hereby sign the CLA

Feb 06 '24 21:02 apyrgio

I see that the last error that the CI test fails with is:

  *** Error in `/tmp/tmp.mJ3LG94vml/venv/bin/python': corrupted double-linked list: 0x0000000003d911e0 ***

This looks unrelated to my PR, but it's worth checking it locally as well. Thanks for your patience, I'll see that the tests pass locally and I'll update this PR.

In the meantime, it would help if you could validate the original assumption of this PR, that logging/warnings should be logged only to stderr. I'm asking because there may be historic reasons why this is not the case, and it's good to know before diving in.

Feb 07 '24 14:02 apyrgio

Thanks for this.

I see that the last error that the CI test fails with is:
  *** Error in `/tmp/tmp.mJ3LG94vml/venv/bin/python': corrupted double-linked list: 0x0000000003d911e0 ***
This looks unrelated to my PR, but it's worth checking it locally as well. Thanks for your patience, I'll see that the tests pass locally and I'll update this PR.

Apologies for this, it looks like there's a recent regression in MuPDF master that is breaking PyMuPDF tests. We're investigating things at the moment.

In the meantime, it would help if you could validate the original assumption of this PR, that logging/warnings should be logged only to stderr. I'm asking because there may be historic reasons why this is not the case, and it's good to know before diving in.

There are some subtleties in the way we output diagnostics with print() and log(). Basically print() is for user messages, and log() is for debugging and internal development. This is why log() always prefixes with file:line:fnname, which is useful for development, but generally not appropriate for user messages. [Having said that, removing the handful of calls to PySys_Write*() would make sense.]

So i think we should retain the current distinction between print() and log().

Some other comments:

Regarding white space in blank lines, personally i like to retain them, but i suspect i'm in a tiny minority here!
Commented-out log() invocations and print() calls can be useful when developing and investigating bugs, as they are already set up ready to show useful info. However we should probably not have commented-out print() calls - if they could be useful, they should be changed to commented-out calls to log().
PyMuPDF is not used in pipelines so i think the stdout/stderr distinction doesn't actually matter. Personally in this case i prefer to use stdout for everything, to simplify sending all output to a file (directly or with tee). And maybe more importantly it avoids buffering/flushing problems that can result in unexpected ordering in the output; this can be really confusing when debugging.

Feb 07 '24 22:02 julian-smith-artifex-com

Apologies for this, it looks like there's a recent regression in MuPDF master that is breaking PyMuPDF tests. We're investigating things at the moment.

Cool, thanks for letting me know :slightly_smiling_face:

There are some subtleties in the way we output diagnostics with print() and log(). Basically print() is for user messages, and log() is for debugging and internal development. [...] So i think we should retain the current distinction between print() and log().

Got it, I had the same suspicion as well.

PyMuPDF is not used in pipelines so i think the stdout/stderr distinction doesn't actually matter.

Oh, is this a conscious decision of the project, or is it because there was never a use case for it? In our case (Dangerzone), we do use PyMuPDF as part of a pipeline. We convert a document to pixels in one container, print those pixels to stdout, and then we reconstruct the document in a second container from these pixels. We do this for PDF sanitization.

Personally in this case i prefer to use stdout for everything, to simplify sending all output to a file (directly or with tee). And maybe more importantly it avoids buffering/flushing problems that can result in unexpected ordering in the output; this can be really confusing when debugging.

I definitely agree with having a single stream for these messages, and I've experienced the buffering problems you describe when this is not the case. That's why in this PR, I tried to switch everything to one stream, stderr, which was used already in some cases.

However, I'm not particularly hung up on the way maintainers prefer to print debug logs. The actual reason I'm sending this PR is because a certain print statement is currently interleaved with our own messages:

https://github.com/pymupdf/PyMuPDF/blob/d20c4e76b6f404d0692155006ccf810709f71aa5/src/init.py#L9636

So, I can close this PR and send a less opinionated one, that simply:

Comments out that particular log line
Fixes the unexpected stdout -> stderr redirection (70b069be07b2b84f4e1b7d7c073c588b03c232ce)

Would this be more preferable?

Feb 08 '24 20:02 apyrgio

So, I can close this PR and send a less opinionated one, that simply:

Comments out that particular log line

Fixes the unexpected stdout -> stderr redirection (https://github.com/pymupdf/PyMuPDF/commit/70b069be07b2b84f4e1b7d7c073c588b03c232ce)

Would this be more preferable?

Kind reminder about the above. Would you prefer to close this PR and send a smaller one, with the above points fixed?

Mar 04 '24 13:03 apyrgio

Apologies for the delay in replying, and thanks for the reminder and your understanding.

I think your suggestion of a simplified PR would be great at this stage.

I'll discuss internally about your use of PyMuPDF in a pipeline. If you're doing it, it needs to be supported. Perhaps we could control where logs/diagnostics go using environment variables, as i can't see a single solution that will address the different concerns that we've been discussing here.

Thanks.

Mar 04 '24 14:03 julian-smith-artifex-com