cc-pyspark icon indicating copy to clipboard operation
cc-pyspark copied to clipboard

Provide classes to use FastWARC to read WARC/WAT/WET files

Open sebastian-nagel opened this issue 3 years ago • 0 comments

FastWARC (see also FastWARC API docs) is a Python WARC parsing library

  • written in C++ for high performance
  • although inspired by warcio, not API compatible
  • without less-frequently used features, eg. reading ARC files or (as of now) chunked transfer encoding

Ideally, API differences between FastWARC and warcio should be hidden away in methods in CCSparkJob or a derived class, so that users do not have to care about the differences, except for very specific cases. Because of the differences and the required compilation of C++ components, usage of FastWARC should be optional.

sebastian-nagel avatar Sep 20 '22 14:09 sebastian-nagel