wayback-machine-scraper icon indicating copy to clipboard operation
wayback-machine-scraper copied to clipboard

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

Results 10 wayback-machine-scraper issues
Sort by recently updated
recently updated
newest added

Hi, thanks for an interesting and useful project which has helped me make a start on reconstructing a site that would be really useful for a research project. I'm new...

Thanks so much for this scraper. It works so much better than the other wayback scraper tools I've found. I'm trying to scrape all snapshots of an old site and...

enhancement

Is there a way that I can get the most recent version (a single version) of a full site crawl of a list of URLs?

enhancement

Command Run: ```wayback-machine-scraper -f 20231201 -t 20231220 http://breitbart.com/ads.txt``` Output: ``` 2024-01-20 09:40:43 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot) 2024-01-20 09:40:43 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.13, cssselect 1.2.0,...

Many moons ago, Internet Archive added some rate limiting that seems to also affect Wayback Machine. ( See discussion on similar project here https://github.com/buren/wayback_archiver/issues/32 ) The scraper scrapes too fast,...

This is an issue with https://github.com/sangaline/scrapy-wayback-machine/issues/11 but it also breaks this project, so I thought it was worth mentioning here. Please accept this pull request https://github.com/sangaline/scrapy-wayback-machine/pull/9

Because these files are really html files, save them as .html so they can be opened in a browser

Hi, I have been trying to get this to work but no luck. The example is the doc is not even working. below is the output. ``` ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',...