Haproxy sample config for usaergant based filtering
As discussed in TG group chat it would be amazing to have an example config for useragent and endpoint based filtering like Anubis has per default ("give challenge to known AI crawlers and everything that says it's a normal browser but let things correctly identifying as bots/tools pass to not impact things like curl or RSS readers").
Not a hard requirement for this tool by any means but would be nice to have.
So basically you want an example to embed https://github.com/ai-robots-txt/ai.robots.txt/ into haproxy as ACL?
Not quite. Mostly a "only give challenges to user agents saying they are a real browser" in combination with a blacklist (that I do not expect you to fill with an up to date list). This is probably trivial if you have a valid regex for "probably an interactive browser" and know haproxy better than I do but I've heard from a few sides that they'd prefer to use berghain but went with anubis because it does this out of the box.
Ok but thats nothing Berghain itself would be able to do as it doesn't do any logic and is just the challenge provider. This would be done in HAProxy itself. I will create a repo with a "best practice haproxy not getting ddos'd" inside that configures berghain and haproxy including all the additional things like proper filters etc
Thank you very much, I'd be very interested in that :D
If interesting, this is is how I solved it thanks to the variable idea from @fionera (trimmed down example, I generated these from the Anubis bot YAML data):
http-request set-var(req.is_goodcrawler) bool(true) if bot_duckduckbot_network bot_duckduckbot_useragent
http-request set-var(req.is_goodcrawler) bool(true) if bot_googlebot_network bot_googlebot_useragent
acl bot_duckduckbot_network src -f /etc/haproxy/allowlists/networks/duckduckbot -n
acl bot_duckduckbot_useragent hdr_reg(User-Agent) "DuckDuckBot/1\.1; \(\+http\://duckduckgo\.com/duckduckbot\.html\)"
acl bot_googlebot_network src -f /etc/haproxy/allowlists/networks/googlebot -n
acl bot_googlebot_useragent hdr_reg(User-Agent) "\+http\://www\.google\.com/bot\.html"
acl good_crawler var(req.is_goodcrawler) -m bool true
http-request return status 403 content-type "text/html" file "/srv/www/berghain/index.html" if !berghain_down !berghain_valid !path_berghain berghain_active !good_crawler
you can then do something similar if you prefer it the other way around (i.e. allow browser user agents but disallow everything else).