crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Unable to scrape Cloudfare protected sites

Open complete-dope opened this issue 11 months ago • 4 comments

crawl4ai version

0.4.248

Expected Behavior

It should be able to scrape sites by bypassing cloudfare , but its unable to do it

Current Behavior

It is unable to bypass the sites protected by cloudfare Example: https://food52.com/

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets


OS

Linux

Python version

3.11

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

complete-dope avatar Feb 14 '25 18:02 complete-dope

@complete-dope Can you share your code snippet or what CrawlerRunConfig, and BrowserConfig with which you are trying to scrape this site?

aravindkarnam avatar Feb 15 '25 09:02 aravindkarnam

I'm running crawl4ai in the docker container, so I am simply sending it to the docker container to chew on.

BarryBahrami avatar Feb 16 '25 16:02 BarryBahrami

I am using the default configuration for scraping

complete-dope avatar Feb 19 '25 18:02 complete-dope

@complete-dope @BarryBahrami

Hey! I just tested crawling the website, and it worked fine, both with and without Docker. Could you share the config you used so we can compare?

The code I run:

import requests
async def call():
    # Configuration objects converted to the required JSON structure
    browser_config_payload = {
        "type": "BrowserConfig",
        "params": {"headless": True}
    }
    crawler_config_payload = {
        "type": "CrawlerRunConfig",
        "params": {"stream": False, "cache_mode": "bypass"} # Use string value of enum
    }
    
    crawl_payload = {
        "urls": ["https://food52.com/"],
        "browser_config": browser_config_payload,
        "crawler_config": crawler_config_payload
    }
    response = requests.post(
        "http://localhost:11235/crawl", # Updated port
        # headers={"Authorization": f"Bearer {token}"},  # If JWT is enabled
        json=crawl_payload
    )
    print(f"Status Code: {response.status_code}")
    if response.ok:
        print(response.json())
    else:
        print(f"Error: {response.text}")

if __name__ == "__main__":
    asyncio.run(call())

Also, try running your test again using the latest version of the library and Docker image. We released a new version just a few days ago.

Here are the tutorial videos in case they help:

Library: https://youtu.be/xo3qK6Hg9AA?feature=shared Docker: https://youtu.be/RwT1MlRfbrA?feature=shared

ntohidi avatar May 05 '25 10:05 ntohidi

I'll close this issue, but feel free to continue the conversation and tag me if the issue persists with our latest version: 0.7.7.

ntohidi avatar Nov 14 '25 10:11 ntohidi