codeface icon indicating copy to clipboard operation
codeface copied to clipboard

How to download mbox directly from http not available in gmane?

Open carlosparadis opened this issue 9 years ago • 3 comments

I am trying to find something that can download lists not in gmane but contain mbox, for instance Apache Software Foundation provides these for virtually all projects. Although I am finding dozens of parsers for a mbox folder, I am yet to find a script that downloads from http to a mbox folder such as codeface/R/ml/download.r does for gmane through nntp-pull.

@wolfgangmauerer do you have any ideas on this? I will post here if I find something on the meantime.

carlosparadis avatar Apr 16 '16 23:04 carlosparadis

Am 17/04/2016 um 01:42 schrieb Carlos Andrade:

I am trying to find something that can download lists not in gmane but contain mbox, for instance Apache Software Foundation provides these for virtually all projects. Although I am finding dozens of parsers for a mbox folder, I am yet to find a script that downloads from http to a mbox folder such as codeface/R/ml/download.r does for gmane through nntp-pull.

@wolfgangmauerer https://github.com/wolfgangmauerer do you have any ideas on this? I will post here if I find something on the meantime.

This problem cannot really be solved in general; there are many many web frontends that expose mailing list archives, and a tool to download from all these would have to provide scrapers for every website. The transport protocol (http) would be the only shared thing.

As for Storm, the project seems to be using mod_mbox, which is an Apache http plugin that provides a web frontend based on mbox files. One option would be to use one of the web scraping frameworks to obtain a list of messages, and then use mod_mbox's capability to generate raw files for single messages. The better alternative in this case, I guess, is to just ask the maintainers if they can directly provide the mbox files.

Best regards, Wolfgang

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/siemens/codeface/issues/47

wolfgangmauerer avatar Apr 17 '16 10:04 wolfgangmauerer

Interesting. I think the only quick way around it for Apache available then is MetricGrimore tool that requests a url. Sadly they pre-fill a database instead of dumping a folder.

Thanks for clarifying!

carlosparadis avatar Apr 17 '16 12:04 carlosparadis

Am 17/04/2016 um 14:01 schrieb Carlos Andrade:

Interesting. I think the only quick way around it for Apache available then is MetricGrimore https://github.com/MetricsGrimoire/MailingListStats tool that requests a url. Sadly they pre-fill a database instead of dumping a folder.

MetricsGrimoire supports (judging from a cursory glance) a very simple web scraper that can download all files that are linked from one main page; I would be astonished if this suffices for Storm. You could use wget -r for the same purpose.

Thanks for clarifying!

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/siemens/codeface/issues/47#issuecomment-211007061

wolfgangmauerer avatar Apr 17 '16 16:04 wolfgangmauerer