Feature request: add WARC output option
Hi @ll.
I have been using monolith more and more for webpage capture but couldn't find a way to make downloads in WARC format (as documented at https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/).
I believe such an option would greatly enhance the reach of monolith as a general purpose utility.
Anyways thanks for your great work as it is. 😎
Hello, in the meantime you can maybe use https://github.com/steffenfritz/html2warc ?
Hi @midaspt,
I'm very glad to learn that you're finding use for monolith! WARC can be simply done, I'll likely implement it around the same time as MHTML. The long story short, I'll make monolith first crawl the target document, download all assets into a store of sorts (cache), and then either build a monolithic HTML, MHTML, or WARC. This way it won't require too much redundant code, and will essentially be the same process for every output format. The first step right now is to revamp the caching mechanism, I'll work on it ASAP.
Hi there @hugo-akaora, thank you for the link! It's in Python, but I'll use it as a reference, seems like a straightforward format.
Cheers, Sunshine
Hello @snshn, nice if it can be implemented directly in monolith! <3
It would really great to be able to output multiple format at the same time :) I'll definitely use that feature!
+1 on MHTML. Been waiting for this feature for almost a full year.
BTW I have a very strong working understanding of MIME and happy to provide advice or perspective. Feel free to reach out about this. For example, I wrote https://github.com/jchook/mime-php