cc-pyspark
cc-pyspark copied to clipboard
Provide classes to use FastWARC to read WARC/WAT/WET files
(address #37)
- implemented
- base class CCFastWarcSparkJob
- examples/applications
- ServerCountFastWarcJob
- ExtractHostLinksFastWarcJob
- tested using FastWARC 0.12.2
- performance comparison warcio <> FastWARC (local mode, small test data)
- 23% faster - ServerCountFastWarcJob (63s -> 48s)
- 8% faster - ExtractHostLinksFastWarcJob (72s -> 66s)
- successfully run ExtractHostLinksFastWarcJob on cluster (Spark on Yarn) to prepare May, June/July, August 2022 web graphs
- to do
- iterate_records(): how to access WARC record offset and length
- more encapsulation: use warcio/fastwarc methods indirectly, so that some examples classes only require to change the base class (CCSparkJob -> CCFastWarcSparkJob)