OpenWPM icon indicating copy to clipboard operation
OpenWPM copied to clipboard

Reduce the surface for bot detection

Open englehardt opened this issue 6 years ago • 7 comments

There are likely a number of ways to identify that we're running Firefox with Selenium/geckodriver. Back in the Selenium 2 days these were injected by the Selenium extension. We made some efforts to prevent that (https://github.com/mozilla/OpenWPM/pull/108). We later removed them with the upgrade to Selenium 3 because, at least at the time, Selenium 3 didn't self-identify via navigator.webdriver (https://github.com/mozilla/OpenWPM/pull/152). I'm guessing that's no longer the case. The move to headless mode from XVFB in #426 may further increase discoverability, since Firefox may skip loading some graphics-related things (rather than load fully in a virtual environment).

See also:

  • https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver
  • https://antoinevastel.com/bot%20detection/2017/08/05/detect-chrome-headless.html
  • https://github.com/antoinevastel/fpscanner

englehardt avatar Aug 09 '19 15:08 englehardt

It looks like WebGL doesn't work properly in headless mode:

selenium_firefox     - DEBUG    - BROWSER -1382707880: driver: JavaScript warning: https://login.taobao.com/member/login.jhtml?tpl_redirect_url=https%3A%2F%2Fwww.tmall.com&style=miniall&enup=true&newMini2=true&full_redirect=true&sub=true&from=tmall&allp=assets_css%3D3.0.10/login_pc.css&pms=1566085516350, line 450: Error: WebGL warning: getContext: Disallowing antialiased backbuffers due to blacklisting.
selenium_firefox     - DEBUG    - BROWSER -1382707880: driver: JavaScript warning: https://login.taobao.com/member/login.jhtml?tpl_redirect_url=https%3A%2F%2Fwww.tmall.com&style=miniall&enup=true&newMini2=true&full_redirect=true&sub=true&from=tmall&allp=assets_css%3D3.0.10/login_pc.css&pms=1566085516350, line 450: Error: WebGL warning: <SetDimensions>: Can't use WebGL in headless mode (https://bugzil.la/1375585).
selenium_firefox     - DEBUG    - BROWSER -1382707880: driver: JavaScript warning: https://login.taobao.com/member/login.jhtml?tpl_redirect_url=https%3A%2F%2Fwww.tmall.com&style=miniall&enup=true&newMini2=true&full_redirect=true&sub=true&from=tmall&allp=assets_css%3D3.0.10/login_pc.css&pms=1566085516350, line 450: Error: WebGL warning: <SetDimensions>: Failed to create WebGL context: WebGL creation failed:
selenium_firefox     - DEBUG    - BROWSER -1382707880: driver: * Can't use WebGL in headless mode (https://bugzil.la/1375585).

englehardt avatar Aug 17 '19 23:08 englehardt

Removing the webdriver attribute, and anything else that reveals to websites that automation is active is a great first step. See:

  • https://github.com/antoinevastel/fpscanner/blob/master/src/fpScanner.js
  • https://github.com/antoinevastel/fp-collect/blob/7963703e97784cd228af35dbce4785c5da0427e2/src/fpCollect.js#L228-L254
  • https://antoinevastel.com/bot%20detection/2017/08/05/detect-chrome-headless.html

We could make a test page that checks for these properties to see which ones are exposed while OpenWPM is driving Firefox with geckodriver/Selenium. Then we'll want to figure out how to remove them.

Overwriting in JS via a content script is still probably the easiest option, but is a bit hacky. Ideally, we would patch Firefox with a build flag that allows us to disable the webdriver self-identification when running crawls; it might be as simple as adding an ifdef around this line [2]. However, it would be helpful to know whether that's the only properly exposed when marionette is enabled (i.e., when geckodriver/selenium is used).

englehardt avatar Oct 11 '19 00:10 englehardt

I am currently fixing the webdriver attribute such that it is set to false as for a regular Firefox instance on Ubuntu. My approach is to overwrite the attribute in a JS content script.

Using Object.defineProperty() opens a new way of identification. Therefore, I switched to another way of overwriting. As this is a bit complex, I oriented the code on other code published under the GNU General Public License v3.0. How is this compatible with the license of OpenWPM? Do I need to consider something special before starting a pull request?

Another thing: For my thesis project I will continue to fix also other revealing things. Is the way of going via separate pull request or via one "big" pull request?

-Daniel

Flnch avatar Oct 20 '19 16:10 Flnch

I am currently fixing the webdriver attribute such that it is set to false as for a regular Firefox instance on Ubuntu. My approach is to overwrite the attribute in a JS content script.

Using Object.defineProperty() opens a new way of identification. Therefore, I switched to another way of overwriting. As this is a bit complex, I oriented the code on other code published under the GNU General Public License v3.0. How is this compatible with the license of OpenWPM? Do I need to consider something special before starting a pull request?

OpenWPM is GPLv3, so that's fine. You can just reference the other codebase following this example.

Another thing: For my thesis project I will continue to fix also other revealing things. Is the way of going via separate pull request or via one "big" pull request?

Individual, self-contained PRs are best. That way we can decide whether to accept your fix for each component individually and can give you feedback as you go. We may choose not to accept some components (due to complexity, etc), but may accept others.

englehardt avatar Oct 22 '19 05:10 englehardt

Found in the wild, some tricks a script uses to detect various browsers (and webdriver):

function Y() {
        try {
            if (null != window._phantom || null != window.callPhantom) return 99;
            if (document.documentElement.hasAttribute && document.documentElement.hasAttribute("webdriver") || null != window.domAutomation || null != window.domAutomationController || null != window._WEBDRIVER_ELEM_CACHE) return 98;
            if (void 0 != window.opera && void 0 != window.history.navigationMode || void 0 != window.opr && void 0 != window.opr.addons && "function" == typeof window.opr.addons.installExtension) return 4;
            if (void 0 != window.chrome &&
                "function" == typeof window.chrome.csi && "function" == typeof window.chrome.loadTimes && void 0 != document.webkitHidden && (1 == document.webkitHidden || 0 == document.webkitHidden)) return 3;
            if (void 0 != window.mozInnerScreenY && "number" == typeof window.mozInnerScreenY && void 0 != window.mozPaintCount && 0 <= window.mozPaintCount && void 0 != window.InstallTrigger && void 0 != window.InstallTrigger.install) return 2;
            if (void 0 != document.uniqueID && "string" == typeof document.uniqueID && (void 0 != document.documentMode && 0 <= document.documentMode ||
                    void 0 != document.all && "object" == typeof document.all || void 0 != window.ActiveXObject && "function" == typeof window.ActiveXObject) || window.document && window.document.updateSettings && "function" == typeof window.document.updateSettings) return 1;
            var b = !1;
            try {
                var c = document.createElement("p");
                c.innerText = ".";
                c.style = "text-shadow: rgb(99, 116, 171) 20px -12px 2px";
                b = void 0 != c.style.textShadow
            } catch (d) {}
            return (0 < Object.prototype.toString.call(window.HTMLElement).indexOf("Constructor") || window.webkitAudioPannerNode &&
                window.webkitConvertPointFromNodeToPage) && b && void 0 != window.innerWidth && void 0 != window.innerHeight ? 5 : 0
        } catch (d) {
            return 0
        }
    }

top_url https://www.pragmatismopolitico.com.br/ script_url https://rtbcdn.doubleverify.com/bsredirect5_internal41.js

englehardt avatar Nov 26 '19 01:11 englehardt

Specifically relevant to webdriver are: if (document.documentElement.hasAttribute && document.documentElement.hasAttribute("webdriver") || null != window.domAutomation || null != window.domAutomationController || null != window._WEBDRIVER_ELEM_CACHE) return 98;

englehardt avatar Nov 26 '19 01:11 englehardt

from @birdsarah: https://dxr.mozilla.org/mozilla-central/search?q=IsHeadless()&redirect=false might be useful for tracking down differences between headless and headed modes

englehardt avatar Apr 28 '20 22:04 englehardt