crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Unable to scape basic data and disable popups

Open complete-dope opened this issue 11 months ago • 1 comments

crawl4ai version

0.4.28

Expected Behavior

Config files :

browser_config = BrowserConfig( headless=True, # Changed to headless mode user_agent_mode="random", text_mode=True, extra_args=["--disable-blink-features=AutomationControlled", "--disable-remote-fonts", "--disable-images", "--disable-software-rasterizer", "--disable-dev-shm-usage"], )

crawler_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, magic=True, verbose=True, log_console=True, simulate_user=True, wait_until="networkidle", only_text=True, exclude_external_images=True, exclude_external_links=True, scan_full_page=True, remove_overlay_elements=True )

Getting this warning : [CONSOLE]. ℹ Console error: Cannot redefine property: webdriver

Using chromium browser Ideally it should be able to bypass / change the webdriver property also this doesnt disable popups that google maps sends .. so any method to solve these both things

Current Behavior

Gives the warning / error and is highly unreliable , sometimes this gives error + scraped data Sometimes its doenst output at all.

Any method / change in conf files that would lead to consistent results !!

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets


OS

Linux

Python version

3.11

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

complete-dope avatar Feb 20 '25 20:02 complete-dope

@complete-dope Can you share the URL also, that's causing this problem?

aravindkarnam avatar Mar 01 '25 13:03 aravindkarnam

Hey, can you update to the latest version of crawl4ai? The popup issue might be fixed with this PR: #1529 Let me know if it works, I'll close this soon!

SohamKukreti avatar Nov 10 '25 18:11 SohamKukreti

I'll close this issue, but feel free to continue the conversation and tag me or @SohamKukreti if the issue persists with our latest version: 0.7.7.

ntohidi avatar Nov 14 '25 11:11 ntohidi

version: crawl4ai-0.7.7

I am also getting this issue. When I spin up a web url, I get this error in the console of the browser:

VM5:8 Uncaught TypeError: Cannot redefine property: webdriver
    at Object.defineProperty (<anonymous>)
    at <anonymous>:8:8
    at <anonymous>:28:7

It seems be linked to simulate_user and magic because when I set it to True, the error shows up. the images below show the examples and it shows up on any url. But I will give the url where I first saw it below.

With simulate_user and or magic as True

Image

With simulate_user and or magic as False

Image

I also get this error right before the page changes automatically to the homepage during the False flag of simulate_user and magic

[ERROR]... × Error updating image dimensions: Page.evaluate: Execution context was destroyed, most likely because of a navigation

My code:

import asyncio 
import random 
from latest_user_agents import get_random_user_agent 
from crawl4ai import AsyncWebCrawler, BrowserConfig, CacheMode, CrawlerRunConfig, UndetectedAdapter 

url = [
"https://www.lawfaremedia.org/article/trump-s-immigration-policies-overlook-ai-talent"
]

crawler_config = CrawlerRunConfig(
    excluded_tags=["nav", "footer", "header", "form", "img", "a", "style", "iframe", "script"],
    only_text=False,
    exclude_external_links=True,
    exclude_social_media_links=True,
    keep_data_attributes=False,
    cache_mode=CacheMode.BYPASS,
    user_agent=user_agent,
    wait_until="domcontentloaded",
    wait_for_timeout=10000000,
    page_timeout=1000000000, #random.uniform(35000, 40000), 
    delay_before_return_html=round(random.uniform(500,800),2),
    max_scroll_steps=int(random.uniform(2,3)),
    scroll_delay=round(random.uniform(.2, .8), 2),
    simulate_user=False,
    magic=False
)

random_screen_width = random.choice([1080, 1440, 1280])
screen_heights = {
    1080: 995,
    1440: 1113,
    1280: 1120
}

browser_config = BrowserConfig(
    headless=False, 
    # text_mode=True,
    # light_mode=True,
    viewport_width=random_screen_width,
    viewport_height=random_screen_height,
    user_agent=get_random_user_agent(),
    ignore_https_errors=True,
    
    extra_args=[
        # '--headless=new',
        '--force-device-scale-factor=0.8',
        '--ignore-certificate-errors',
        '--no-sandbox',
        '--disable-dev-shm-usage',
        '--disable-setuid-sandbox',
        '--disable-blink-features=AutomationControlled',
        '--disable-web-resources',
        '--disable-gpu', # Critical for Lambda
        '--single-process'  # Critical for Lambda 
    ],
    enable_stealth=True,
)

async def test_run_v2():
    async def crawl_with_timeout(crawler, url, crawler_config, timeout):
        try:
            result = await asyncio.wait_for(
                crawler.arun(url, config=crawler_config),
                timeout=timeout
            )
            return result
        except asyncio.TimeoutError:
            print(f"Crawl timeout for {url}")
            return None 
    
    async def crawl_with_retry(url, browser_config, crawler_config, max_retries, timeout, semaphore=None):
        async with AsyncWebCrawler(config=browser_config) as crawler:
            async with semaphore:  # Only one URL at a time
                for attempt in range(max_retries + 1):
                    result = await crawl_with_timeout(crawler, url, crawler_config, timeout=timeout)
                    if is_success(result):
                        return result
                    if attempt < max_retries:
                        wait_time = 2 ** attempt + random.uniform(0, 1)
                        await asyncio.sleep(wait_time)
                return result
    
    # Create semaphore to limit concurrency to n number
    semaphore = asyncio.Semaphore(3)

    # Create tasks
    tasks = [crawl_with_retry(url, browser_config, crawler_config, 3, timeout=400002, semaphore=semaphore) for url in urls]

    # Run all tasks (but only n number of concurrent due to semaphore)
    results = await asyncio.gather(*tasks)

    return results 
    
asyncio.run(test_run_v2())

Socvest avatar Nov 15 '25 23:11 Socvest