aut
aut copied to clipboard
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Users may desire outputs in WARC format after filtering their RDD[ArchiveRecord].
**Describe the bug** On a number FDLP and Stanford collections, we run into this space heap space error, and it kills the Spark job. Upon investigation, this does not seem...
Bumps [jsoup](https://github.com/jhy/jsoup) from 1.14.2 to 1.15.3. Release notes Sourced from jsoup's releases. jsoup 1.15.3 jsoup 1.15.3 is out now, and includes a security fix for potential XSS attacks, along with...
EDIT: this helped, the doc may need to be updated: ``` sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/") ``` **Describe the bug** According to the docs, `aut` should be able to read data from `s3a`...
Bumps [org.apache.tika:tika-core](https://github.com/apache/tika) from 1.23 to 3.2.2. Changelog Sourced from org.apache.tika:tika-core's changelog. Release 4.0.0-BETA1 - ??? BREAKING CHANGES Moved towards default json based configuration (TIKA-4544 and many others). tika-pipes implementation modules...
**Problem** Class ExtractBoilerpipeText doesn't fully do what it purports to; ie. it sometimes leaves (often large) portions of eg. header and comment thread text in the output. [Boilerpipe](https://github.com/kohlschutter/boilerpipe) was last...