handout icon indicating copy to clipboard operation
handout copied to clipboard

Direct export into more formats

Open danijar opened this issue 6 years ago • 12 comments

The output format is currently HTML, which can then be printed as PDF from the browser. It could be nice to directly export to PDF, ipynb and other formats.

danijar avatar Aug 05 '19 18:08 danijar

As for PDF I think we need a little roadmap here as there are tradeoffs. I tried weasyprint, xhtml2pdf and wkhtmltopdf (or pdfkit, which is the wrapper for the same engine).

  • weasyprint and xhtml2pdf are easily installable via pip, so they can be a dependency in the python project, but they mess the output on current CSS. Some dark magic needs to be applied to HTML styling for the current html to appear well in these engines.

  • wkhtmltopdf/pdfkit needs a binary to be installed (or an apt-get call), but the pdf looks decent:

изображение

  • There are different routes with pandoc, but it is also a hard dependency.

  • There may be other options I did not try.

We can also put a little section "what you can do next with html" in README or docs and let the pdf export wait until a better choice emerges.

Export pagination may be a barrier, might need a method for Handout class for page break.

For disussion of tools I refered to manubot. These guys have a fallback procedure on generating the pdf based on tools that are available of the system (athena is there is Docker).

The relation of tools to engines is as below:

  • weasyprint - own?
  • xhtml2pdf - ReportLab
  • wkhtmltopdf/pdfkit - WebKit

epogrebnyak avatar Aug 06 '19 00:08 epogrebnyak

I thought a bit about exporting into more formats. There will be some formats that HTML can easily be converted to and some that can't.

Converting from HTML won't work if the user is interested in the exported source rather than just the rendered result. This is the case for e.g. LaTeX which you may want to include in a larger LaTeX document and keep editing. Another example could be exporting to Jupyter notebooks.

For this, we need to support multiple exporters. HTML, LaTeX, and ipynb come to mind for which this would be useful. Display formats like PDF are easy to generate from any of the three. To convert to LaTeX, we'll need a Markdown to LaTeX converter to format multi-line comments.

I would like to keep all base features available for all output formats, e.g. we should robustly embed videos into LaTeX. Besides this, the will be doc.add_html() and doc.add_latex() etc, which will just be no-ops when exporting to a different format.

It will take a little bit of planning to decide how to structure the code for this. For example, we may want to make the blocks (blocks.py) independent of the output format and then have one class per output type that with methods that "visits" the blocks in the document.

class Exporter:
  __init__(directory)
  visit_comment(block)  # e.g. block.text
  visit_text(block)  # e.g. block.text
  visit_image(block)  # e.g. block.filename, block.width
  visit_video(block)  # e.g. block.filename, block.width
  save()

What do you think? Do you have other ideas how the code should be structured?

danijar avatar Aug 09 '19 19:08 danijar

I agree with sequence of formats - there are 'immediate output' formats ones like html, ipynb, latex and display formats like pdf which is based on processing either html or latex.

As I do not fully understand Exporter class above yet, so let me elaborate in prose on the program structure:

  1. we have some source of the report, which consists of .py script body and calls to add_somthing() funcitons
  2. currently add_something's create html on the fly, as specified by blocks.py classes
  3. we want to support different export formats (html, latex, ipynb, maybe varieties of markdown)
  4. we possibly may want to allow 'scriptless'/'interactive' use Handout class as in #25
  5. source parts may need different preprocessing depending on export format (a lot of that happens in jupyter).

Source script processing and add_x() calls should result in a list of blocks holding Message, Text, Image, Video, Code instances. These blocks would hold just the content such as text, filename, display parameters.

Then there is a render_html(), render_latex(), render_markdown() function or method that converts each block type into a new format.

Finally there is a functionality that assembles converted blocks into html, latex, ipynb or markdown document.


# Handout class is exposed to the user: 
# - the user inits a handout in a script to display script comments and code in output
# - the user adds elements as images and video to the output inside a script
# - alternatively, the user plays with instance in interactive session, 
#   just using the add_x() methods 

class Handout:
    def __init___(directory, title, interactive=False):
            pass

# blocks represent report contents units
# they hold values and display configurations
# maybe blacks can be dataclasses, to make the constructors cleaner
 
class Block:
    pass

class Message(Block):
     pass

class Text(Block): #this is for multiline comments
     pass

class Image(Block):
     pass

class Video(Block):
     pass

class Code(Block):
     pass

# something is done to produce internal representation of the document
# as a list of blocks. this is what Handout class does now, but it is tightly 
# bundled with html output

class Document:
     self.title: str
     self.blocks: [Block]

# document can be exported to different formats

def to_html(doc: Document) -> str:
    pass

def to_latex(doc: Document) -> str:
    pass

def to_notebook(doc: Document) -> str:
    pass

def to_markdown(doc: Document) -> str:
    pass

epogrebnyak avatar Aug 10 '19 13:08 epogrebnyak

As a sidenote the role of markdown is still to be discussed:

  • in some workflows people might want markdown output to be embedded in larger markdown documents, eg a make part of README file
  • in some cases - exporting to swaeve/pwaeve/jupytext markdown reports may be desired (raised in https://github.com/danijar/handout/issues/12 for example).

We can start with simplest type of markdown.

epogrebnyak avatar Aug 10 '19 14:08 epogrebnyak

Thanks for your example. What I had in mind is the visitor pattern, which seems like a better solution to me. What do you think?

# The user API gets a new constructor argument:
class Handout:
  __init__(directory, title='Handout', format='html', source=True)
  @_blocks
  add_text(string)  # Add Text block.
  add_image(tensor, width=None, format='png')  # Save to disk and add Image block.
  add_html(string)  # Add HTML block.
  show()  # Iterate over blocks and call according exporter methods.
  _find_source()  # Find user's Python source; can be extracted out some day.

# Blocks are independent of output format:
Text = namedtuple('Text', 'string')
Image = namedtuple('Image', 'filename, width')
HTML = namedtuple('HTML', 'string')

# Exporters are visitors:
class HTMLExporter:
  __init__(directory, source, title)
  @lines
  visit_text(text)  # Add lines to self.lines
  visit_image(image)
  visit_html(html)
  save()  # Save lines to index.html.

class LaTeXExporter:
  __init__(directory, source, title)
  visit_text(text)
  visit_image(image)
  visit_html(html)  # No-op.
  save()

One question is where to specify the export type. It shouldn't be in show() since that is often called many times. It could be in in the Handout constructor, but that means you have to run multiple times to export into multiple formats. However, I think this might be fine. The constructor could accept a list of output formats if this is really a use case.

By the way, do you have a preference for how to name the exporters? I can think of exporter, output, target, backend

danijar avatar Aug 10 '19 17:08 danijar

The visit_something() seem very redundant to me. For testability one apparently would need to do quite a few things in this setting just to see if the program converts a block type well from the source.

from dataclasses import dataclass

class Block:
    pass

@dataclass
class Message(Block):
  string: str

  def html(self):
     return '<pre class="message">' + self.string + '</pre>'
 
  def markdown(self):
     return self.string

  def latex(self):
     pass

assert Message('Some text').html() == '<pre class="message">Some text</pre>'

This way we keep data and block conversion fucntions closer together, much easier testing.

Later in code you can have a visitor class that assembles the full html document or a body of latex or an ipynb file.

class LaTeXExporter:
  __init__(directory, blocks, title)
  render()
  save()

epogrebnyak avatar Aug 10 '19 21:08 epogrebnyak

class Handout:
  __init__(directory, title='Handout', format='html', source=True)

What is does source=True mean? Better if it were a more verbose flag.

epogrebnyak avatar Aug 10 '19 21:08 epogrebnyak

As for show() - we are considering this a fixed API interface, right? I remember there was a discussion or a change of .save() vs .show(). Once show() is fixed, format="html" is ok for constructor.

As an extra feature can add .save_html(), save_latex(), etc methods to Handout class.

epogrebnyak avatar Aug 10 '19 21:08 epogrebnyak

The constructor could accept a list of output formats if this is really a use case.

If we provide save_x() family of methods same Handout instance can be used several times if that fits the workflow when the use wants both an htmnl and a latex for example.

epogrebnyak avatar Aug 10 '19 22:08 epogrebnyak

By the way, do you have a preference for how to name the exporters? I can think of exporter, output, target, backend

'Exporter' seems quite a natural, I think it stresses we are doing one-directional conversion. 'Backend' is closer to render-only option without saving a file. output and target are too generic I think.

epogrebnyak avatar Aug 10 '19 22:08 epogrebnyak

In addition to LaTeX export, it would make sense to export to Markdown. This might also be easier for users to further convert into other formats downstream.

danijar avatar Aug 24 '19 01:08 danijar

@danijar, what is inside visit_html method? does this have an advantage over using a blocks own html() method?

epogrebnyak avatar Aug 24 '19 08:08 epogrebnyak