cc-pyspark icon indicating copy to clipboard operation
cc-pyspark copied to clipboard

Process Common Crawl data with Python and Spark

Results 14 cc-pyspark issues
Sort by recently updated
recently updated
newest added

I am using a 2021 iMac with the Apple M1 chip and macOS Monterey 12.4. So far to set up PySpark I have `pip3 installed pyspark`, plus cloned this repo...

- Created a spark job subclassing CCSparkJob to retrieve html text data. This job is working when passing input file with

I can query on Warc files and find positions of Persian language, but I want just text of it specifically wet file of them, is there any option for language...

[warcio](https://github.com/webrecorder/warcio#arc-files) is able to read ARC files as well, so it should be possible to run all examples designed to work on WARC files also on ARC files from the...

enhancement

See [Plan for dropping Python 2 support](https://spark.apache.org/news/plan-for-dropping-python-2-support.html) - but there is little to do, only the `import` statements for `urlparse`/`urljoin` need to be removed.

enhancement

(address #37) - implemented - base class CCFastWarcSparkJob - examples/applications - ServerCountFastWarcJob - ExtractHostLinksFastWarcJob - tested using FastWARC 0.12.2 - performance comparison warcio FastWARC (local mode, small test data) -...

enhancement

[FastWARC](https://github.com/chatnoir-eu/chatnoir-resiliparse/tree/develop/fastwarc) (see also [FastWARC API docs](https://resiliparse.chatnoir.eu/en/latest/api/fastwarc.html)) is a Python WARC parsing library - written in C++ for high performance - although inspired by [warcio](https://github.com/webrecorder/warcio), not API compatible - without less-frequently...

enhancement

With #37 it might be a good idea to provide an example which uses [Resiliparse's text extractors](https://resiliparse.chatnoir.eu/en/latest/man/extract/html2text.html) or simply the performant HTML parser.

enhancement

[Simdjson](https://simdjson.org/) ([pysimdjson](https://pysimdjson.tkte.ch/)) should be faster than [ujson](https://pypi.org/project/ujson/) when parsing WAT payloads. Could be worth to use it as a drop-in replacement if installed (cf. #34 regarding ujson replacing the built-in...

enhancement

For some spark jobs, we want to process an entire file at one time. I copied and simplified sparkcc to do this. This is used in the upcoming integrity process.