Ian Milligan

Results 28 comments of Ian Milligan

This looks great - I think this'd be useful, and these topics look like they'd complement some of our collections really well: the CPP collection we've got at http://webarchives.ca, and...

Does your application work if you launch it via `spark-shell`, along lines of on a local machine: ``` bash /home/i2millig/spark-1.5.1/bin/spark-shell --driver-memory 6G --jars ~/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar ``` To launch, and then `:paste:`...

Yes, my understanding is that `loadArchives` loads everything in memory – down the road, I think we'd like to explore using CDX files to be a bit more selective (i.e....

We do have problem with large WARC files, which I'll continue in #254. This is weird, @dportabella – when you rebuild the WARC archive, does it work? Thanks!

Just re-pinging this to keep it alive. I think I have a good use case for this too. Now to find time..

Just re-ran the plain text extractor and this is still an issue. These are images where I think the mime type is erroneously set to html. Related to #163, is...

Just keeping this alive – was playing around with an `ExtractEntities` call on another test collection and crashed on: ``` Unparseable header line: [?slÑ???r???]QQoGâ?XyÚ 6?YÛ¤¶i·J­Ö¤Rö ?âØÌ6M·_¿Ï?©Òd ÙçóÝ}Çábw?Ó Á?vxÿg? ¹©¥ju?\iå r3Ρ5wR«Øsp+Eߨùí;à$...

I'm not quite sure how to grab the record name, as I've got limited errors thrown. I'll put the gist here and maybe we can quickly chat about it today...

Just re-opening this. Did we reach any agreement here?

Great! Will test this.