[Bug]: Unable to scape basic data and disable popups
crawl4ai version
0.4.28
Expected Behavior
Config files :
browser_config = BrowserConfig( headless=True, # Changed to headless mode user_agent_mode="random", text_mode=True, extra_args=["--disable-blink-features=AutomationControlled", "--disable-remote-fonts", "--disable-images", "--disable-software-rasterizer", "--disable-dev-shm-usage"], )
crawler_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, magic=True, verbose=True, log_console=True, simulate_user=True, wait_until="networkidle", only_text=True, exclude_external_images=True, exclude_external_links=True, scan_full_page=True, remove_overlay_elements=True )
Getting this warning : [CONSOLE]. ℹ Console error: Cannot redefine property: webdriver
Using chromium browser Ideally it should be able to bypass / change the webdriver property also this doesnt disable popups that google maps sends .. so any method to solve these both things
Current Behavior
Gives the warning / error and is highly unreliable , sometimes this gives error + scraped data Sometimes its doenst output at all.
Any method / change in conf files that would lead to consistent results !!
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
Linux
Python version
3.11
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response
@complete-dope Can you share the URL also, that's causing this problem?
Hey, can you update to the latest version of crawl4ai? The popup issue might be fixed with this PR: #1529 Let me know if it works, I'll close this soon!
I'll close this issue, but feel free to continue the conversation and tag me or @SohamKukreti if the issue persists with our latest version: 0.7.7.
version: crawl4ai-0.7.7
I am also getting this issue. When I spin up a web url, I get this error in the console of the browser:
VM5:8 Uncaught TypeError: Cannot redefine property: webdriver
at Object.defineProperty (<anonymous>)
at <anonymous>:8:8
at <anonymous>:28:7
It seems be linked to simulate_user and magic because when I set it to True, the error shows up. the images below show the examples and it shows up on any url. But I will give the url where I first saw it below.
With simulate_user and or magic as True
With simulate_user and or magic as False
I also get this error right before the page changes automatically to the homepage during the False flag of simulate_user and magic
[ERROR]... × Error updating image dimensions: Page.evaluate: Execution context was destroyed, most likely because of a navigation
My code:
import asyncio
import random
from latest_user_agents import get_random_user_agent
from crawl4ai import AsyncWebCrawler, BrowserConfig, CacheMode, CrawlerRunConfig, UndetectedAdapter
url = [
"https://www.lawfaremedia.org/article/trump-s-immigration-policies-overlook-ai-talent"
]
crawler_config = CrawlerRunConfig(
excluded_tags=["nav", "footer", "header", "form", "img", "a", "style", "iframe", "script"],
only_text=False,
exclude_external_links=True,
exclude_social_media_links=True,
keep_data_attributes=False,
cache_mode=CacheMode.BYPASS,
user_agent=user_agent,
wait_until="domcontentloaded",
wait_for_timeout=10000000,
page_timeout=1000000000, #random.uniform(35000, 40000),
delay_before_return_html=round(random.uniform(500,800),2),
max_scroll_steps=int(random.uniform(2,3)),
scroll_delay=round(random.uniform(.2, .8), 2),
simulate_user=False,
magic=False
)
random_screen_width = random.choice([1080, 1440, 1280])
screen_heights = {
1080: 995,
1440: 1113,
1280: 1120
}
browser_config = BrowserConfig(
headless=False,
# text_mode=True,
# light_mode=True,
viewport_width=random_screen_width,
viewport_height=random_screen_height,
user_agent=get_random_user_agent(),
ignore_https_errors=True,
extra_args=[
# '--headless=new',
'--force-device-scale-factor=0.8',
'--ignore-certificate-errors',
'--no-sandbox',
'--disable-dev-shm-usage',
'--disable-setuid-sandbox',
'--disable-blink-features=AutomationControlled',
'--disable-web-resources',
'--disable-gpu', # Critical for Lambda
'--single-process' # Critical for Lambda
],
enable_stealth=True,
)
async def test_run_v2():
async def crawl_with_timeout(crawler, url, crawler_config, timeout):
try:
result = await asyncio.wait_for(
crawler.arun(url, config=crawler_config),
timeout=timeout
)
return result
except asyncio.TimeoutError:
print(f"Crawl timeout for {url}")
return None
async def crawl_with_retry(url, browser_config, crawler_config, max_retries, timeout, semaphore=None):
async with AsyncWebCrawler(config=browser_config) as crawler:
async with semaphore: # Only one URL at a time
for attempt in range(max_retries + 1):
result = await crawl_with_timeout(crawler, url, crawler_config, timeout=timeout)
if is_success(result):
return result
if attempt < max_retries:
wait_time = 2 ** attempt + random.uniform(0, 1)
await asyncio.sleep(wait_time)
return result
# Create semaphore to limit concurrency to n number
semaphore = asyncio.Semaphore(3)
# Create tasks
tasks = [crawl_with_retry(url, browser_config, crawler_config, 3, timeout=400002, semaphore=semaphore) for url in urls]
# Run all tasks (but only n number of concurrent due to semaphore)
results = await asyncio.gather(*tasks)
return results
asyncio.run(test_run_v2())