Compressed pdf larger than original?
$ pdfly compress in.pdf out.pdf
Original Size : 1,996,123
Compressed Size: 2,014,972 (100.9% of original)
How is this possible?
Please complete with test code, input file and output file Like this, we can not do any review
Please complete with test code, input file and output file Like this, we can not do any review
I cannot provide the input and output files as they contain sensitive personal information. Just try it out with some PDFs on your computer and you'll see that the compress command is broken.
I am having the same issue with multiple pdf files.
$ pdfly compress Lockhart_2002_-_A_Mathematician\'s_Lament.pdf Lockhart_compressed.pdf
Ignoring wrong pointing object 0 0 (offset 0)
Ignoring wrong pointing object 91 0 (offset 0)
Ignoring wrong pointing object 93 0 (offset 0)
Original Size : 400,277
Compressed Size: 418,320 (104.5% of original)
Lockhart_2002_-_A_Mathematician's_Lament.pdf Lockhart_compressed.pdf
Another example:
$ pdfly compress Example_form.pdf Output.pdf
Original Size : 95,569
Compressed Size: 103,325 (108.1% of original)
Strangely, trying to compress the output of this form reduces the size, although it is still larger than the original:
$ pdfly compress Output.pdf Out2.pdf
Original Size : 103,325
Compressed Size: 98,634 (95.5% of original)
these cases are possible. The compression applies a loss-less compression on streams but some other solution such as building streams of object could reduce size too. However pypdf currently has no capability to build such streams and define a strategy to compress them. The only easy solution I could currently image would be to write the output into a stream compare size and if greater than the original just return the original file. If this sounds good to you, do not hesitate to propose a PR
The only easy solution I could currently imagine would be to write the output into a stream compare size and if greater than the original just return the original file. If this sounds good to you, do not hesitate to propose a PR
I think that's a good idea 🙂
We could also indicate the final file size and the compression performa (80% ? 10%?) in pdfly compress output.
A PR implementing this would be welcome 🙂
This may be scope creep, but I was hoping there was an option for a lossy compression algorithm (jpeg?) on any embedded images in addition to PDF compression. So it could get the file size down for online forms with strict size requirements, especially if there's a lot of pictures.
On Thu, Jul 18, 2024, 14:30 pubpub-zz @.***> wrote:
these cases are possible. The compression applies a loss-less compression on streams but some other solution such as building streams of object : reduce size too. However pypdf currently has no capability to build such streams and define a strategy to compress them. The only easy solution I could currently image would be to write the output into a stream compare size and if greater than the original just return the original file. If this sounds good to you, do not hesitate to propose a PR
— Reply to this email directly, view it on GitHub https://github.com/py-pdf/pdfly/issues/52#issuecomment-2237376678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJM5ASUTAQUHLOSCICT7F6LZNAJVHAVCNFSM6AAAAABIZZL77CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZXGM3TMNRXHA . You are receiving this because you commented.Message ID: @.***>
Hey @Lucas-C!
I've implemented a fix for the compression issue where files could end up larger than the original. #173
Approach: Added size comparison logic that writes to memory first, compares compressed vs original size, and keeps the smaller version. Also fixed metadata preservation during compression.
Improvements:
- ✅ Prevents file size increases (addresses the 104.5% issue)
- ✅ Better user feedback with compression metrics
- ✅ Preserves PDF metadata (title, author, etc.)
- ✅ Comprehensive test coverage
Could you please assign a Hacktoberfest label to my PR? Thanks!