cc-pyspark icon indicating copy to clipboard operation
cc-pyspark copied to clipboard

Provide classes to use FastWARC to read WARC/WAT/WET files

Open sebastian-nagel opened this issue 3 years ago • 0 comments

(address #37)

  • implemented
    • base class CCFastWarcSparkJob
    • examples/applications
      • ServerCountFastWarcJob
      • ExtractHostLinksFastWarcJob
  • tested using FastWARC 0.12.2
  • performance comparison warcio <> FastWARC (local mode, small test data)
    • 23% faster - ServerCountFastWarcJob (63s -> 48s)
    • 8% faster - ExtractHostLinksFastWarcJob (72s -> 66s)
  • successfully run ExtractHostLinksFastWarcJob on cluster (Spark on Yarn) to prepare May, June/July, August 2022 web graphs
  • to do
    • iterate_records(): how to access WARC record offset and length
    • more encapsulation: use warcio/fastwarc methods indirectly, so that some examples classes only require to change the base class (CCSparkJob -> CCFastWarcSparkJob)

sebastian-nagel avatar Sep 21 '22 12:09 sebastian-nagel