berghain Haproxy sample config for usaergant based filtering

As discussed in TG group chat it would be amazing to have an example config for useragent and endpoint based filtering like Anubis has per default ("give challenge to known AI crawlers and everything that says it's a normal browser but let things correctly identifying as bots/tools pass to not impact things like curl or RSS readers").

Not a hard requirement for this tool by any means but would be nice to have.

Apr 28 '25 08:04 XDjackieXD

So basically you want an example to embed https://github.com/ai-robots-txt/ai.robots.txt/ into haproxy as ACL?

Apr 28 '25 09:04 fionera

Not quite. Mostly a "only give challenges to user agents saying they are a real browser" in combination with a blacklist (that I do not expect you to fill with an up to date list). This is probably trivial if you have a valid regex for "probably an interactive browser" and know haproxy better than I do but I've heard from a few sides that they'd prefer to use berghain but went with anubis because it does this out of the box.

Apr 28 '25 10:04 XDjackieXD

Ok but thats nothing Berghain itself would be able to do as it doesn't do any logic and is just the challenge provider. This would be done in HAProxy itself. I will create a repo with a "best practice haproxy not getting ddos'd" inside that configures berghain and haproxy including all the additional things like proper filters etc

Apr 28 '25 10:04 fionera

Thank you very much, I'd be very interested in that :D

Apr 28 '25 10:04 XDjackieXD

If interesting, this is is how I solved it thanks to the variable idea from @fionera (trimmed down example, I generated these from the Anubis bot YAML data):

  http-request set-var(req.is_goodcrawler) bool(true) if bot_duckduckbot_network bot_duckduckbot_useragent
  http-request set-var(req.is_goodcrawler) bool(true) if bot_googlebot_network bot_googlebot_useragent

  acl bot_duckduckbot_network    src                  -f /etc/haproxy/allowlists/networks/duckduckbot -n
  acl bot_duckduckbot_useragent  hdr_reg(User-Agent)  "DuckDuckBot/1\.1; \(\+http\://duckduckgo\.com/duckduckbot\.html\)"
  acl bot_googlebot_network      src                   -f /etc/haproxy/allowlists/networks/googlebot -n
  acl bot_googlebot_useragent    hdr_reg(User-Agent)  "\+http\://www\.google\.com/bot\.html"

  acl good_crawler               var(req.is_goodcrawler) -m bool true

  http-request return status 403 content-type "text/html" file "/srv/www/berghain/index.html" if !berghain_down !berghain_valid !path_berghain berghain_active !good_crawler

you can then do something similar if you prefer it the other way around (i.e. allow browser user agents but disallow everything else).

Jun 02 '25 16:06 tacerus