Noah Levitt comments

Results 43 comments of


                                            Noah Levitt

How to connect db entries from the table "sites" to a belonging warc-file?

You can set the warc prefix using warcprox-meta as shown here: https://github.com/internetarchive/brozzler/blob/master/job-conf.rst#using-warcprox-meta If you don't, captures from all your jobs and sites will be mixed together in the same warcs.

Update macOS instructions for Chromium installation

Most people have google chrome installed and brozzler should find it automatically. Don't want to make people think it's more trouble to install brozzler than it actually is.

How to add behaviors?

Hello @sepastian, what @galgeek said, and the code that loads behaviors is here: https://github.com/internetarchive/brozzler/blob/master/brozzler/__init__.py#L97 Maybe you could add support for an environment variable or command line option to point brozzler...

How to add behaviors?

@sepastian something like what you propose should be fine, though I'm not sure about the details at this moment. I think it's best to focus on identifying the issue with...

Feature request: Pass rendered DOM to youtube-dl instead of asking youtube-dl to download the page from the original URL

This is a cool idea. But it's one of those things that I probably won't have time to implement myself (unless it happens to be needed to solve some issue...

how does worker pick a site after crash?

If you wait an hour, it should start crawling again. See https://github.com/internetarchive/brozzler/blob/e23fa68d6/brozzler/frontier.py#L117. If you can't wait, you could set `claimed=false` in rethinkdb.

how does worker pick a site after crash?

@mishranitin2003 It's not random. It has to be high enough that you will never have one worker claim a site when another is legitimately working on it. The value should...

JavaScript files harvested as partial content (HTTP 206) break playback

Interesting. I don't think we have a configuration mechanism to avoid saving 206's at the moment. It might be worth adding such a feature. Alternatively it might also make sense...

Performance Suggestions?

> I'm running a local version of the website on my local machine. So that site is not running from it's public domain. Is there way to get brozzler to...

Alternative handling of illegal IDNs (such as domains with emojis)

See also #40 and http://unicode.org/cldr/utility/idna.jsp?a=%E2%98%83.net Imho this library needs to match browser behavior (at least optionally) or its usefulness is severely limited. Which also means it should reject urls that...