Ian Milligan issues

Results 5 issues of


                                            Ian Milligan

Memory Issues on Large WARC Files

I've been tinkering around with @dportabella's #246 issue, as we also have some very large WARCs in a collection (i.e. some of 7GB, others of 40,50,60GB). We do run into...

bug

feature

K-Means Clustering

As discussed, we're interested in incorporating K-Means clustering into warcbase. Can we take a collection (part of GeoCities, for example, or a smaller Archive-It collection) and separate it into k...

feature

Tweet URL Extraction: All Twitter Shortlinks

Right now, our script for URL extraction is as follows: ``` import org.warcbase.spark.matchbox._ import org.warcbase.spark.matchbox.TweetUtils._ import org.warcbase.spark.rdd.RecordRDD._ val tweets = RecordLoader.loadTweets("/mnt/vol1/data_sets/elxn42/ruest-white/elxn42-tweets-combined-deduplicated-unshortened-fixed.json", sc) val r = tweets.flatMap(tweet => {"""http://[^ ]+""".r.findAllIn(tweet.text).toList}) .countItems()...

feature

New Twitter Features: Few Suggestions, Request for Further Suggestions

Right now we've got URL extraction, language extraction, hashtag extraction, and image extraction. We should have a few more features documented. I think this could begin with: - plain text...

feature

Image Data Creeping into Plain Text

Some images have been sneaking into the extracted plain text, perhaps because (as per @anjackson) we are trusting server Content-Type. The binary data throws off/breaks text analysis workflows. See figure...

bug