Noah Levitt
Noah Levitt
You can set the warc prefix using warcprox-meta as shown here: https://github.com/internetarchive/brozzler/blob/master/job-conf.rst#using-warcprox-meta If you don't, captures from all your jobs and sites will be mixed together in the same warcs.
Most people have google chrome installed and brozzler should find it automatically. Don't want to make people think it's more trouble to install brozzler than it actually is.
Hello @sepastian, what @galgeek said, and the code that loads behaviors is here: https://github.com/internetarchive/brozzler/blob/master/brozzler/__init__.py#L97 Maybe you could add support for an environment variable or command line option to point brozzler...
@sepastian something like what you propose should be fine, though I'm not sure about the details at this moment. I think it's best to focus on identifying the issue with...
This is a cool idea. But it's one of those things that I probably won't have time to implement myself (unless it happens to be needed to solve some issue...
If you wait an hour, it should start crawling again. See https://github.com/internetarchive/brozzler/blob/e23fa68d6/brozzler/frontier.py#L117. If you can't wait, you could set `claimed=false` in rethinkdb.
@mishranitin2003 It's not random. It has to be high enough that you will never have one worker claim a site when another is legitimately working on it. The value should...
Interesting. I don't think we have a configuration mechanism to avoid saving 206's at the moment. It might be worth adding such a feature. Alternatively it might also make sense...
> I'm running a local version of the website on my local machine. So that site is not running from it's public domain. Is there way to get brozzler to...
See also #40 and http://unicode.org/cldr/utility/idna.jsp?a=%E2%98%83.net Imho this library needs to match browser behavior (at least optionally) or its usefulness is severely limited. Which also means it should reject urls that...