Ian Milligan
Ian Milligan
I've been tinkering around with @dportabella's #246 issue, as we also have some very large WARCs in a collection (i.e. some of 7GB, others of 40,50,60GB). We do run into...
As discussed, we're interested in incorporating K-Means clustering into warcbase. Can we take a collection (part of GeoCities, for example, or a smaller Archive-It collection) and separate it into k...
Right now, our script for URL extraction is as follows: ``` import org.warcbase.spark.matchbox._ import org.warcbase.spark.matchbox.TweetUtils._ import org.warcbase.spark.rdd.RecordRDD._ val tweets = RecordLoader.loadTweets("/mnt/vol1/data_sets/elxn42/ruest-white/elxn42-tweets-combined-deduplicated-unshortened-fixed.json", sc) val r = tweets.flatMap(tweet => {"""http://[^ ]+""".r.findAllIn(tweet.text).toList}) .countItems()...
Right now we've got URL extraction, language extraction, hashtag extraction, and image extraction. We should have a few more features documented. I think this could begin with: - plain text...
Some images have been sneaking into the extracted plain text, perhaps because (as per @anjackson) we are trusting server Content-Type. The binary data throws off/breaks text analysis workflows. See figure...