WebCrawler
WebCrawler copied to clipboard
WebCrawler
WebCrawler allows to extract all accessible URLs from a website. It's built using .NET Core and .NET Standard 1.4, so you can host it anywhere (Windows, Linux, Mac).
The crawler does not use regex to find links. Instead, Web pages are parsed using AngleSharp,
a parser which is built upon the official W3C specification. This allows to parse pages as a browser and handle tricky tags such as base.
For HTML files, URLs are extracted from:
<a href="..."><area href="..."><audio src="..."><iframe src="..."><img src="..."><img srcset="..."><link href="..."><object data="..."><script src="..."><source src="..."><source srcset="..."><track src="..."><video src="..."><video poster="..."><... style="...">(see CSS section)
For CSS files, URLs are extracted from:
rule: url(...)

How to deploy on Azure (free)
You can deploy the website on Azure for free:
- Create a free Web App
- Enable WebSockets in Application Settings (Introduction to WebSockets on Windows Azure Web Sites, Using Web Sockets with ASP.NET Core)
- Deploy the website using WebDeploy or FTP
Blog posts
Some parts of the code are explained in blog posts: