robotex
robotex copied to clipboard
Ruby library to obey robots.txt
@chriskite , is this repo being maintained? Any interest in adding a maintainer or transferring maintenance to someone else?
Resolves https://github.com/chriskite/robotex/issues/5.
**Bug Steps:** Attempt to extract the `delay` for a user agent from a `robots.txt` file where the `crawl-delay` for a specific user agent appears after the rule for `*`. Example:...
Resolves https://github.com/chriskite/robotex/issues/7
Robotex does not follow redirected robots.txt pages, which can result in pages erroneously appearing to be `allowed?`. Example: In https://www.yelp.com/robots.txt: `Disallow: /biz_link` ```ruby > robotex = Robotex.new "My User Agent"...
It would be very useful to have a method to extract the sitemaps listed in a robots.txt file, per the sitemaps specification: https://www.sitemaps.org/protocol.html#submit_robots Example usage: http://www.nytimes.com/robots.txt ``` Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/sitemap.xml.gz Sitemap:...
Rules with trailing comments are not being applied correctly. Example: https://ask.fmcsa.dot.gov/robots.txt ``` User-agent: * # ADDED BY HMS Disallow: / # ADDED BY HMS ``` The above rule should disallow...