browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Run a high-fidelity browser-based web archiving crawler in a single Docker container

Results 197 browsertrix-crawler issues
Sort by recently updated
recently updated
newest added

It's probably time to add some sort of default ad-blocking would great enhance performance for many sites, and probably something that should be added to base image. There's a few...

Launched a run via zimit using 0.7.0.beta.1 and the crawl process never exited. ``` Running browsertrix-crawler crawl: crawl --newContext page --waitUntil load,networkidle0 --depth -1 --timeout 90 --behaviors autoplay,autofetch,siteSpecific --behaviorTimeout 90...

The https://github.com/browserless/chrome image is fairly impressive and provides a great core dockerized browser, with many of the features needed for browsertrix crawler, including screencasting and the interactive debugger. Need to...

As an archiver, I want to make sure that the most important subpages from a domain are crawled as soon as possible in long-running crawls to prevent data loss. There...

Consider an API (via the web server) that could alter the scoping rules mid-crawl, for example, usually to add additional exclusion rules and filter down an existing queue. This can...

I am trying to crawl the Oauth2 authentication [Microsoft Stream](https://web.microsoftstream.com/studio/videos) site and found https://github.com/internetarchive/heritrix3/issues/446 that suggests using the Interactive Profile Creation option. Please let me know how to use the...

Hi, I've been looking to run some crawls of my organisation's Sharepoint/intranet site but I'm having some issues getting through Microsoft 2FA Authentication. Using --interactive successfully creates a profile of...

Determine what crawl logs should be generated to help debug crawl. The page list (stored in `pages.jsonl`) already includes info on what pages were visited and when. Possible additions: -...

question

Support a way to set a locale that the browser is run with. Probably best option is to set it with the profile, which can be done by setting the...

Trying a quick test with a simple website : `sudo docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url http://info.cern.ch/ --generateWACZ --text --collection test` result in : ``` Storing state in...