ipe icon indicating copy to clipboard operation
ipe copied to clipboard

Ipe 7.2.24 produces significantly larger files than 7.1.7

Open ByteHamster opened this issue 4 years ago • 5 comments

I have some pdf files that were created with Ipe 7.1.7. When I open these files in Ipe 7.2.24 and save them without modifications, their file size gets significantly larger. ~This file~ (removed, see other file below), for example, grows by a factor of more than 5: from 115 KB to 630 KB. The behavior is the same on Arch Linux and MacOS.

When then including that file in a LaTeX beamer presentation, the increased size becomes a rather big problem. I noticed this because my presentation suddenly went from 3 MB to 70 MB - just by saving one Ipe figure (different slides of the beamer presentation show different pdf pages of the Ipe figure).

My workaround is to run ghostscript on the presentation after compiling: gs -q -sDEVICE=pdfwrite -o presentation-size-fix.pdf presentation.pdf. With that 70 MB file, running the ghostscript command takes about 1-2 minutes (compared to about 2 seconds with the old Ipe image), making it rather hard to work with the presentation.

Do you have an idea why saving the Ipe file with a more recent version increases the file size that much?

ByteHamster avatar Jan 07 '22 18:01 ByteHamster

Between 7.1.7 and 7.2.24, Ipe switched to a much more general method for including the PDF resources from the pdflatex output. And for some reason, pdflatex already produces a rather large file (278kB, larger then the Ipe 7.1.7 version of the entire document).

How do you include the pages in your beamer document? pdflatex should be smart enough that when you include various pages using \includegraphics[page=xx], then it should not make duplicates of the PDF resources from the included file for each page. On the other hand, if you export individual pages and then include those, you very quickly blow up the file size (related to issue #193).

otfried avatar Jan 07 '22 19:01 otfried

Thank you for your reply. I do use \includegraphics<xx>[page=xx]{filename.pdf}, all from a single pdf file. Apparently, pdflatex sometimes does produce duplicates of pdf resources. I have another file for you where the effect is even more extreme.

Steps to reproduce using this zip: https://drive.google.com/file/d/1XT9WGbGXYAelK3188hV7Mk1k0kS--yW9/view?usp=sharing

  • Note that image.pdf is 200 kB
  • Run pdflatex demo.tex
  • demo.pdf is 215 kB

Clean up the files LaTeX generated. Now, open image.pdf in Ipe and save it without modification.

  • Note that image.pdf is now 4 MB (that's 20 times larger)
  • Run pdflatex demo.tex
  • demo.pdf is 41 MB (that's 190 times larger)
$ pdflatex --version
pdfTeX 3.141592653-2.6-1.40.22 (TeX Live 2021/Arch Linux)
kpathsea version 6.3.3
Copyright 2021 Han The Thanh (pdfTeX) et al.
There is NO warranty.  Redistribution of this software is
covered by the terms of both the pdfTeX copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the pdfTeX source.
Primary author of pdfTeX: Han The Thanh (pdfTeX) et al.
Compiled with libpng 1.6.37; using libpng 1.6.37
Compiled with zlib 1.2.11; using zlib 1.2.11
Compiled with xpdf version 4.03

ByteHamster avatar Jan 07 '22 20:01 ByteHamster

Would it be possible to have Ipe execute that ghostscript command¹ after building the document, but before embedding its own data? Then the "more general method for including the PDF resources from the pdflatex output" can be kept while still producing output files with a more reasonable size.

¹ gs -q -sDEVICE=pdfwrite -o presentation-size-fix.pdf presentation.pdf

ByteHamster avatar Jan 19 '22 10:01 ByteHamster

I was convinced that pdflatex is smart enough to not duplicate resource when you include multiple pages from the same document - the reason being, that I wrote that code for pdflatex in 2001. It turns out that this does not work anymore, at least not when used the standard way through \includegraphics. That explains why demo.pd is so gigantic: it has all the fonts and all the XForm objects from image.pdf 28 times.

However, there is a simple trick:

If you modify your file demo.tex to start like this:

\documentclass{beamer}
\pdfximage{image.pdf}   %% this is the new line
\begin{document}
\begin{frame}{Test}
\includegraphics<1>[page=1,width=0.9\textwidth]{image.pdf}%
\includegraphics<2>[page=2,width=0.9\textwidth]{image.pdf}%
\includegraphics<3>[page=3,width=0.9\textwidth]{image.pdf}%
...

then the duplication of resources does not happen. You can easily check that the result has each font only once.

This doesn't change that image.pdf is very large. You have used about 4600 separate text objects, most of them containing only a single letter - and Ipe makes a PDF XForm object for each of these, that's a lot of overhead for what should be a single letter. The reason the file was much smaller in earlier versions of Ipe is that Ipe then simply included the PDF-stream for the XForm inside the page stream. That works only for simple text, it would make it impossible to use Tikz, \includegraphics, or other interesting stuff inside text objects. I will have to have a closer look what I can do to improve this. Perhaps Ipe can detect when it is safe to include the form literally and optimize the output then. It should also detect when text objects are identical and reuse the XForm then.

Running ghostscript basically parses the entire document and rendering it into a PDF writer. In this case, that eliminates the overhead of the PDF XForms - but it's not always the appropriate thing to do (in other cases you would duplicate the contents of XForms, leading to files that are actually larger), and I'm not sure if it can actually handle documents with links and named objects.

otfried avatar Jan 24 '22 22:01 otfried

Thank you for looking into this!

If you modify your file demo.tex to start like this [...] then the duplication of resources does not happen.

I can confirm that this reduces the file size of my "production" files (not test file) from 70MB to 8MB. The GhostScript command from above still brings it down to about 3MB, so I will probably leave that one in the makefile. (Also, I don't think my colleagues will remember to add the pdfximage "import" for every changed image). Unfortunately, the GhostScript command still takes a very long time even with the pdfximage workaround - so it would still be great if both workarounds would not be necessary :)

Perhaps Ipe can detect when it is safe to include the form literally and optimize the output then. It should also detect when text objects are identical and reuse the XForm then.

That sounds awesome! Thank you!

ByteHamster avatar Jan 25 '22 21:01 ByteHamster