cc-pyspark
cc-pyspark copied to clipboard
Provide classes to use FastWARC to read WARC/WAT/WET files
FastWARC (see also FastWARC API docs) is a Python WARC parsing library
- written in C++ for high performance
- although inspired by warcio, not API compatible
- without less-frequently used features, eg. reading ARC files or (as of now) chunked transfer encoding
Ideally, API differences between FastWARC and warcio should be hidden away in methods in CCSparkJob or a derived class, so that users do not have to care about the differences, except for very specific cases. Because of the differences and the required compilation of C++ components, usage of FastWARC should be optional.