warcbase
warcbase copied to clipboard
Warcbase is an open-source platform for managing analyzing web archives
when i try to run "mvn clean package appassembler:assemble -DskipTests" command on a ubuntu 16.04 server, i got build error: PluginParameterException. After search online and read the appassembler doc, i...
#255 Fix for NoSuchFieldException in org.warcbase.data.HBaseTableManager
Currently, in order to use warcbase, users need to clone the repo and build using maven. This requires users to have JDK and Maven installed on their machines. Should we...
We need to process a WARC archive, filter it based on keywords, and create a WARC archive. Something like this: ``` RecordLoader.loadArchives(in, sc) .keepValidPages() .filter(r => r.getContentString.contains("my keyword")) .saveAsWarcArchive("/path/out.warc.gz") ```...
I try to get rid of duplicate pages as follows: ``` val r = RecordLoader.loadArchives("/directory/to/arc/file.arc.gz", sc) .keepValidPages() .groupBy(_.getUrl).values.map(_.head) // remove duplicates .map(r => r.getUrl) .take(10) but I get this exception:...
When ingesting WARC/ARC files to HBase via IngestFiles script (appassemble) NoSuchFieldException gets thrown by HBaseTableManager as its constructor tries to access non-existent field maxKeyValueSize on HTable object via reflection. As...
I had memory problems running my program, and I see that I cannot even run this very simple example: ``` package application import org.apache.spark._ import org.warcbase.spark.matchbox.RecordLoader import org.warcbase.spark.rdd.RecordRDD._ object Test...
I've been tinkering around with @dportabella's #246 issue, as we also have some very large WARCs in a collection (i.e. some of 7GB, others of 40,50,60GB). We do run into...