polipus
polipus copied to clipboard
Polipus: distributed and scalable web-crawler framework
I was able to crawl Unicode pages in 0.4.0 but after upgrading to 0.5.0 only some English characters would be in a crawled page. Please let me if there any...
I found that urls containing anchors like "#sku:123" (e.g a semi-colon) were not cleaned up when passed to the `to_absolute` method . As a consequence, they were escaped and added...
https://github.com/taganaka/polipusBundler could not find compatible versions for gem "bson": In snapshot (Gemfile.lock): bson (1.9.2) In Gemfile: mongoid (~> 4.0.0) ruby depends on moped (~> 2.0.0) ruby depends on bson (~>...
Does it make sense to have support for headless crawling built-in to the framework? A lot of the websites these days are Single Page apps and crawling that using the...
When anchor links are found during the crawl (i.e http://www.example.com/abc.html#foo), they are encoded : the anchor tag is replaced with the escaped character %23, which causes the page to respond...
For some reason I'm not able to install polipus gem on JRuby 1.7.13. Tried both Windows 8.0 and Ubuntu 12.04. Got the same error message. Gem::Installer::ExtensionBuildError: ERROR: Failed to build...
Hi! it seems one of our threads is stuck in an HTTP call. I think the function is: https://github.com/taganaka/polipus/blob/master/lib/polipus/http.rb#L170 It looks like the connection is never closed. Any idea what...
As a result from #33, I reconsidered the current structure or `PolipusCrawler`. Especially `PolipusCrawler#takeover` is a very long method where lots is going on at the same time. `PolipusCrawler` itself...
When you try to resolve a domain which does not exist, polipus creates an error page with `SocketError`. Actually, the page does not exist anymore. So it's like a 404...
Current s3 implementation is partially broken and it doesn't work well under heavy load Providing a fog adapter looks a way better to me http://fog.io/storage/