Investigate alternatives to saving page content as individual gzipped files
Basically, the cost of reading a bunch of small files from S3 is much too high (see: http://garrens.com/blog/2017/11/04/big-data-spark-and-its-small-files-problem/).
We should think about how we can batch the saving of content or apply a post-crawl batching of the individual files. Ideally, we'd preserve our ability to relatively easily retrieve the content for random individual files.
I've updated the title to reflect that our current way of saving these files leads to issues while crawling + while analyzing the data. See https://github.com/mozilla/openwpm-crawler/issues/33 for a description of the issues while crawling.
We should consider both of these constraints when we investigate an alternate way to handle saved files.
Single Document (html, js, jpg...)saved and looked up randomly by key.
May be able to use a key-value store like DynamoDB from Amazon (or whatever the equivalent is on GCP)?