wayback-machine-scraper issues

'wayback-machine-scraper' is not recognized as an internal or external command, operable program or batch file.

2

I installed it using PIP but this error appears

snapshot functionality for a full site at a given time?

Hi, thanks for an interesting and useful project which has helped me make a start on reconstructing a site that would be really useful for a research project. I'm new...

DOSull

Following image links

2

Thanks so much for this scraper. It works so much better than the other wayback scraper tools I've found. I'm trying to scrape all snapshots of an old site and...

ellyjonez

enhancement

[Question] How to get latest crawl?

1

Is there a way that I can get the most recent version (a single version) of a full site crawl of a list of URLs?

santoshbs

enhancement

Fixed issues with wayback machine scraper

anikafuloria

'ExecutionEngine' object has no attribute 'schedule'

1

Command Run: ```wayback-machine-scraper -f 20231201 -t 20231220 http://breitbart.com/ads.txt``` Output: ``` 2024-01-20 09:40:43 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot) 2024-01-20 09:40:43 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.9.13, cssselect 1.2.0,...

Yash-Vekaria

Error 429 + Scraper gives up

2

Many moons ago, Internet Archive added some rate limiting that seems to also affect Wayback Machine. ( See discussion on similar project here https://github.com/buren/wayback_archiver/issues/32 ) The scraper scrapes too fast,...

a-n-d-a-i

Broken with Scrapy 2.x

This is an issue with https://github.com/sangaline/scrapy-wayback-machine/issues/11 but it also breaks this project, so I thought it was worth mentioning here. Please accept this pull request https://github.com/sangaline/scrapy-wayback-machine/pull/9

a-n-d-a-i

Save snapshots .html instead of .snapshot

Because these files are really html files, save them as .html so they can be opened in a browser

raphaelmerx

Not scraping any page

2

Hi, I have been trying to get this to work but no luck. The example is the doc is not even working. below is the output. ``` ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',...

josylad

wayback-machine-scraper
wayback-machine-scraper copied to clipboard

Metadata

'wayback-machine-scraper' is not recognized as an internal or external command, operable program or batch file.

snapshot functionality for a full site at a given time?

Following image links

[Question] How to get latest crawl?

Fixed issues with wayback machine scraper

'ExecutionEngine' object has no attribute 'schedule'

Error 429 + Scraper gives up

Broken with Scrapy 2.x

Save snapshots .html instead of .snapshot

Not scraping any page

← Metadata

Owner

Metadata

wayback-machine-scraper wayback-machine-scraper copied to clipboard

Metadata

← Metadata

Owner

Metadata

wayback-machine-scraper
wayback-machine-scraper copied to clipboard