cc-pyspark issues

Incompatible Architecture

3

I am using a 2021 iMac with the Apple M1 chip and macOS Monterey 12.4. So far to set up PySpark I have `pip3 installed pyspark`, plus cloned this repo...

swetepete

boto3 credentials error when running CCSparkJob with ~100 S3 warc paths as input, but works with <10 S3 warc paths as input

4

- Created a spark job subclassing CCSparkJob to retrieve html text data. This job is working when passing input file with

praveenr019

download only specific language data from wet files like Warc

I can query on Warc files and find positions of Persian language, but I want just text of it specifically wet file of them, is there any option for language...

aliebrahiiimi

Test and update examples to work with ARC files of the 2008 - 2012 crawls

[warcio](https://github.com/webrecorder/warcio#arc-files) is able to read ARC files as well, so it should be possible to run all examples designed to work on WARC files also on ARC files from the...

sebastian-nagel

enhancement

See [Plan for dropping Python 2 support](https://spark.apache.org/news/plan-for-dropping-python-2-support.html) - but there is little to do, only the `import` statements for `urlparse`/`urljoin` need to be removed.

sebastian-nagel

enhancement

Provide classes to use FastWARC to read WARC/WAT/WET files

(address #37) - implemented - base class CCFastWarcSparkJob - examples/applications - ServerCountFastWarcJob - ExtractHostLinksFastWarcJob - tested using FastWARC 0.12.2 - performance comparison warcio FastWARC (local mode, small test data) -...

sebastian-nagel

enhancement

Provide classes to use FastWARC to read WARC/WAT/WET files

[FastWARC](https://github.com/chatnoir-eu/chatnoir-resiliparse/tree/develop/fastwarc) (see also [FastWARC API docs](https://resiliparse.chatnoir.eu/en/latest/api/fastwarc.html)) is a Python WARC parsing library - written in C++ for high performance - although inspired by [warcio](https://github.com/webrecorder/warcio), not API compatible - without less-frequently...

sebastian-nagel

enhancement

Example using Resiliparse's HTML parser or text extractor

With #37 it might be a good idea to provide an example which uses [Resiliparse's text extractors](https://resiliparse.chatnoir.eu/en/latest/man/extract/html2text.html) or simply the performant HTML parser.

sebastian-nagel

enhancement

Use simdjson to read WAT payloads

8

[Simdjson](https://simdjson.org/) ([pysimdjson](https://pysimdjson.tkte.ch/)) should be faster than [ujson](https://pypi.org/project/ujson/) when parsing WAT payloads. Could be worth to use it as a drop-in replacement if installed (cf. #34 regarding ujson replacing the built-in...

sebastian-nagel

enhancement

Add CCFileProcessorSparkJob to support file-wise processing

4

For some spark jobs, we want to process an entire file at one time. I copied and simplified sparkcc to do this. This is used in the upcoming integrity process.

jt55401

cc-pyspark
cc-pyspark copied to clipboard

Metadata

Incompatible Architecture

boto3 credentials error when running CCSparkJob with ~100 S3 warc paths as input, but works with <10 S3 warc paths as input

download only specific language data from wet files like Warc

Test and update examples to work with ARC files of the 2008 - 2012 crawls

Drop support for Python2

Provide classes to use FastWARC to read WARC/WAT/WET files

Provide classes to use FastWARC to read WARC/WAT/WET files

Example using Resiliparse's HTML parser or text extractor

Use simdjson to read WAT payloads

Add CCFileProcessorSparkJob to support file-wise processing

← Metadata

Owner

Metadata

cc-pyspark cc-pyspark copied to clipboard

Metadata

← Metadata

Owner

Metadata

cc-pyspark
cc-pyspark copied to clipboard