aut icon indicating copy to clipboard operation
aut copied to clipboard

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Results 6 aut issues
Sort by recently updated
recently updated
newest added

Users may desire outputs in WARC format after filtering their RDD[ArchiveRecord].

enhancement

**Describe the bug** On a number FDLP and Stanford collections, we run into this space heap space error, and it kills the Spark job. Upon investigation, this does not seem...

bug

Bumps [jsoup](https://github.com/jhy/jsoup) from 1.14.2 to 1.15.3. Release notes Sourced from jsoup's releases. jsoup 1.15.3 jsoup 1.15.3 is out now, and includes a security fix for potential XSS attacks, along with...

dependencies

EDIT: this helped, the doc may need to be updated: ``` sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/") ``` **Describe the bug** According to the docs, `aut` should be able to read data from `s3a`...

bug

Bumps [org.apache.tika:tika-core](https://github.com/apache/tika) from 1.23 to 3.2.2. Changelog Sourced from org.apache.tika:tika-core's changelog. Release 4.0.0-BETA1 - ??? BREAKING CHANGES Moved towards default json based configuration (TIKA-4544 and many others). tika-pipes implementation modules...

Java
dependencies

**Problem** Class ExtractBoilerpipeText doesn't fully do what it purports to; ie. it sometimes leaves (often large) portions of eg. header and comment thread text in the output. [Boilerpipe](https://github.com/kohlschutter/boilerpipe) was last...

enhancement
dependencies