cc-pyspark
cc-pyspark copied to clipboard
Process Common Crawl data with Python and Spark
I am using a 2021 iMac with the Apple M1 chip and macOS Monterey 12.4. So far to set up PySpark I have `pip3 installed pyspark`, plus cloned this repo...
- Created a spark job subclassing CCSparkJob to retrieve html text data. This job is working when passing input file with
I can query on Warc files and find positions of Persian language, but I want just text of it specifically wet file of them, is there any option for language...
[warcio](https://github.com/webrecorder/warcio#arc-files) is able to read ARC files as well, so it should be possible to run all examples designed to work on WARC files also on ARC files from the...
See [Plan for dropping Python 2 support](https://spark.apache.org/news/plan-for-dropping-python-2-support.html) - but there is little to do, only the `import` statements for `urlparse`/`urljoin` need to be removed.
(address #37) - implemented - base class CCFastWarcSparkJob - examples/applications - ServerCountFastWarcJob - ExtractHostLinksFastWarcJob - tested using FastWARC 0.12.2 - performance comparison warcio FastWARC (local mode, small test data) -...
[FastWARC](https://github.com/chatnoir-eu/chatnoir-resiliparse/tree/develop/fastwarc) (see also [FastWARC API docs](https://resiliparse.chatnoir.eu/en/latest/api/fastwarc.html)) is a Python WARC parsing library - written in C++ for high performance - although inspired by [warcio](https://github.com/webrecorder/warcio), not API compatible - without less-frequently...
With #37 it might be a good idea to provide an example which uses [Resiliparse's text extractors](https://resiliparse.chatnoir.eu/en/latest/man/extract/html2text.html) or simply the performant HTML parser.
[Simdjson](https://simdjson.org/) ([pysimdjson](https://pysimdjson.tkte.ch/)) should be faster than [ujson](https://pypi.org/project/ujson/) when parsing WAT payloads. Could be worth to use it as a drop-in replacement if installed (cf. #34 regarding ujson replacing the built-in...
For some spark jobs, we want to process an entire file at one time. I copied and simplified sparkcc to do this. This is used in the upcoming integrity process.