browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Improved Crawl Log Data

Open ikreymer opened this issue 4 years ago • 1 comments

Determine what crawl logs should be generated to help debug crawl. The page list (stored in pages.jsonl) already includes info on what pages were visited and when. Possible additions:

  • Page crawl graph data? (seed and crawl depth of each page)
  • Behavior state log?
  • Page resources? (which pages were each resources loaded from?)

Thinking the crawl graph data per page (seed and depth) and additional logging of behaviors would be most useful. The page resources will of course be available in the cdx.

In future, a pageId may also be added to WARC headers to be able to better map resources->page

ikreymer avatar Jul 27 '21 15:07 ikreymer

It would be great if a list of top level domains that were accessed during could be generated. After doing a first crawl with a default browser profile, this list could be used to create a profile that, for instance, is logged into any websites that are part of this crawl, or already dismiss all the cookie settings or "first time visit" banners if desired, and repeat the crawl. This data can be extracted from pages.jsonl but since this file also can contain full-text data per page it can become complicated quickly.

Additionally, a log that would contain all the URLs causing errors would be very useful, for instance to capture them later.

despens avatar Apr 04 '22 14:04 despens

Improved logging merged in #195. Significant changes include:

  • Logs are output as json-l with proper log levels and contexts to support filtering
  • Page crawl graph data included
  • Behaviors logged by default
  • New custom crawl stats implementation, replacing puppeteer-cluster stats
  • Optional debug logging and improved optional jserror logging

Not yet included are the suggestions in https://github.com/webrecorder/browsertrix-crawler/issues/74#issuecomment-1087661811, though URLs causing errors are logged and can be extracted via filtering on log level and context

tw4l avatar Jan 19 '23 19:01 tw4l

@despens it seems like the main outstanding issue from your comment is that getting TLDs from pages.jsonl can be difficult because of the presence of extracted full text, which seems best handled in a separate issue. I'm going to close this issue and we can track that in https://github.com/webrecorder/browsertrix-crawler/issues/203. Thanks!

tw4l avatar Jan 19 '23 19:01 tw4l