etherface icon indicating copy to clipboard operation
etherface copied to clipboard

Bypassing scrape protection on etherscan

Open kalanyuz opened this issue 2 years ago • 4 comments

Recently they have deployed Cloudflare script that returns 403 if you are accessing the website from scripts.

kalanyuz avatar Apr 06 '23 02:04 kalanyuz

This is such a dick move from Etherscan, but it doesn't really matter as Etherscan scraping has been disabled for a while with the current Etherface deployment. That being said, I'd be happy to accept a PR for this issue if you're interested in working on this.

volsa avatar Apr 06 '23 15:04 volsa

Do you have any recommendations on where to begin @volsa ? On the top of my head this situation could be handled with Selenium. Not sure if there's a workaround for rust.

kalanyuz avatar Apr 07 '23 01:04 kalanyuz

Yeah, Selenium was the first solution that popped into my mind. The other was embedding Python code using PyO3 to use cloudscraper because no such Rust libraries exist, but I'm not sure if the library is even working atm. Long-term, Selenium is probably the better solution though.

volsa avatar Apr 07 '23 12:04 volsa

I did some quick research to see if this can be accomplished in Rust using ChromeDriver, and it kind of works. Key findings were:

  1. The ChromeDriver has to be patched before it can be used because CloudFlare otherwise blocks the request. To do that download https://chromedriver.storage.googleapis.com/index.html?path=112.0.5615.49/ then apply the following https://github.com/ultrafunkamsterdam/undetected-chromedriver/blob/bf7dcf8b5713020de7454844fb80036b8c456503/undetected_chromedriver/patcher.py#L217-L239
  2. Flags --disable-blink-features and --disable-blink-features=AutomationControlled must be set; haven't tested if either one alone is sufficient but should be?
  3. (MacOS ARM only) Patching the ARM ChromeDriver will result in panics, thus the x86_64 version is needed using Rosetta

Calling the following code using the fantoccini library should then bypass the CF protection.

use fantoccini::{ClientBuilder, Locator};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut caps = serde_json::map::Map::new();
    caps.insert(
        "goog:chromeOptions".to_string(),
        serde_json::json!({
                "args": [
                    // "--headless=new",
                    "--disable-blink-features",
                    "--disable-blink-features=AutomationControlled",
            ]}
        ),
    );

    let client = ClientBuilder::native().capabilities(caps).connect("http://localhost:4444").await?;

    client.goto("https://etherscan.io/contractsVerified").await?;
    let res = client.wait().for_element(Locator::Css("#content > section.container-xxl.pt-5.pb-12")).await?;

    let html = res.html(true).await.unwrap();
    println!("{html}");

    Ok(())
}

https://user-images.githubusercontent.com/29666622/233849890-57bd2463-0079-46d9-b945-c4101e346ca2.mov

Ideally this can be merged with https://github.com/volsa/etherface/blob/master/etherface-lib/src/api/etherscan.rs

volsa avatar Apr 23 '23 15:04 volsa