OpenWPM Investigate alternatives to saving page content as individual gzipped files

Basically, the cost of reading a bunch of small files from S3 is much too high (see: http://garrens.com/blog/2017/11/04/big-data-spark-and-its-small-files-problem/).

We should think about how we can batch the saving of content or apply a post-crawl batching of the individual files. Ideally, we'd preserve our ability to relatively easily retrieve the content for random individual files.

Dec 15 '18 00:12 englehardt

I've updated the title to reflect that our current way of saving these files leads to issues while crawling + while analyzing the data. See https://github.com/mozilla/openwpm-crawler/issues/33 for a description of the issues while crawling.

We should consider both of these constraints when we investigate an alternate way to handle saved files.

Oct 04 '19 23:10 englehardt

Single Document (html, js, jpg...)saved and looked up randomly by key.

Nov 12 '19 11:11 vringar

May be able to use a key-value store like DynamoDB from Amazon (or whatever the equivalent is on GCP)?

Nov 12 '19 11:11 englehardt

GCP equivalents seem to be BigTable and FireStore

Nov 13 '19 11:11 vringar