LeMoussel
LeMoussel
Crawl-delay, indicates the number of seconds for a crawler/spider to delay between requests. **robots.txt with Crawl-delay** ``` User-agent: Googlebot Crawl-delay: 20 User-agent: Slurp Crawl-delay: 20 User-Agent: msnbot Crawl-Delay: 20 ```...
If you’re on 64-bit Windows, you’ll see that you have a PROCESSOR_ARCHITECTURE environment variable: > C:\>echo %PROCESSOR_ARCHITECTURE% > AMD64 This is from a 64-bit Command Prompt on 64-bit Windows. If...
an extension that use [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome) ?
I think [chromedp ](https://github.com/chromedp/chromedp)is a solution studied. _Package chromedp is a faster, simpler way to drive browsers (Chrome, Edge, Safari, Android, etc) without external dependencies (ie, Selenium, PhantomJS, etc) using...
@alejoloaiza , You can change `UserAgent ` with the runner options. You can look for `UserAgent ` option at https://github.com/chromedp/chromedp/blob/e57a331e5c3c3b51ba749c196f092966b9ae233e/runner/runner.go#L393 For example : ``` cdp.New(ctxt, cdp.WithRunnerOptions( runner.UserAgent(""), )) ```
Both are interesting 1) Override the default options with `DefaultExecAllocatorOptions`. [Example](https://godoc.org/github.com/chromedp/chromedp#example-ExecAllocator) 2) Setting for every request. Example : different proxy for every request.
I propose to use [Robots Parser](https://github.com/samclarke/robots-parser) library with this common functions in `utils.js` : - getRobotsTxt(url) - isAllowedRobotsTxt(url, ua) - isDisallowedRobotsTxt(url, ua) - getCrawlDelayRobotsTxt(ua)
> This is already possible by using `Network.set_template` and setting it to a custom template. `write_html` will then load the template from this `path`. Yep! I tested by doing this...
@DigitalGreyHat Can you give some/more information about your API?
Do you know when this PR will be merged? (I'd like to test for French)