engine icon indicating copy to clipboard operation
engine copied to clipboard

Bypass bot detectors

Open LVerneyPEReN opened this issue 5 years ago • 10 comments

Hi,

Rakuten and Leboncoin have very strong bot detectors, hence preventing from automatically fetching their CGUs (at least on a regular OVH machine). See https://fr.shopping.rakuten.com/newhelp/conditions-generales/ or https://www.leboncoin.fr/dc/cgu. It is possible that #138 and having JS enabled will help here, but I think this won't be enough.

Best,

EDIT: Same for RueDuCommerce (see https://www.rueducommerce.fr/info/mentions-legales/cgv) or FNAC (https://www.fnac.com/Help/cgv-fnac#bl=footer), they all use the same system, powered by Datadome.

LVerneyPEReN avatar Oct 09 '20 13:10 LVerneyPEReN

Hi,

I hope using a headless browser will fix this. So I suggest to wait for #138 to be implemented and see if there is still this issue. Unless you have a quicker to implement idea to fix it?

Ndpnt avatar Oct 15 '20 08:10 Ndpnt

Using a headless browser is not enough to fix this. You have to disguise it (https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth for instance) and you are still identified by your IP address (DataDome used on Leboncoin for instance does this), if you are connecting from a server infrastructure (not residential).

LVerneyPEReN avatar Oct 15 '20 15:10 LVerneyPEReN

As discussed with @LucasVerneyDGE and @TomHouriezDGE, this option will be needed for some sources, even after #138 is fixed. However, it also raises legal questions. @LucasVerneyDGE will investigate which entities might have power to legally bypass access control systems, and we will design the most appropriate software architecture (opt-in, opt-out, plugin) based on the legal assessment 🙂

MattiSG avatar Oct 16 '20 10:10 MattiSG

Hi all

jumping back on this matter as we encounter it more and more often

One of the common issues we find is being confronted to a 403 due to Web Application Firewall (WAF)

We already encountered 3 of them with

  • Cloudflare https://github.com/ambanum/OpenTermsArchive/issues/316
  • Imperva https://github.com/ambanum/OpenTermsArchive/issues/319
  • Datadome : this ticket

@LVerneyPEReN do you have any news? I contacted Imperva and Cloudflare to become a whitelisted bot and am waiting for their answers

martinratinaud avatar Aug 26 '21 06:08 martinratinaud

Legal analysis by PEReN was still pending on 08/03/2022.

Imperva and Cloudflare answers are still pending.

In order to help with prioritisation, instead of listing issues in this repository, they are now labeled in each affected instance with dedicated tags (403, timeout…).

MattiSG avatar Apr 25 '22 07:04 MattiSG

@LVerneyPEReN did the PEReN finish its legal analysis? 🙂

On our side, I believe we never got a reply from Imperva nor Cloudflare (please correct me if I'm wrong @martinratinaud).

MattiSG avatar Apr 24 '23 08:04 MattiSG

@LVerneyPEReN did the PEReN finish its legal analysis? 🙂

On our side, I believe we never got a reply from Imperva nor Cloudflare (please correct me if I'm wrong @martinratinaud).

Indeed, we did not 😔

martinratinaud avatar Apr 24 '23 09:04 martinratinaud

Cloudflare maintains a list of verified bots. They state “Cloudflare manually approves well-behaved services that benefit the broader Internet and honor robots.txt.” There is on this page a link to “add a bot” that requires having a Cloudflare account.

MattiSG avatar Jun 26 '24 13:06 MattiSG