[Bug]: Browser path detection failing in Windmill.dev with crawl4ai
crawl4ai version
0.4.247
Expected Behavior
I'm trying to use crawl4ai with Windmill (https://www.windmill.dev/) for browser automation. However, I'm having trouble setting a executable path for the browser.
Issue:
The Windmill documentation (https://www.windmill.dev/docs/advanced/browser_automation#examples) provides an example for launching a browser instance:
const browser = await chromium.launch({
executablePath: "/usr/bin/chromium",
args: ['--no-sandbox', '--single-process', '--no-zygote', '--disable-setuid-sandbox', '--disable-dev-shm-usage', '--disable-gpu'],
});
When running crawl4ai without configuring the specific path, I receive the following error:
Error: BrowserType.launch: Executable doesn't exist at /tmp/.cache/ms-playwright/chromium-1148/chrome-linux/chrome
╔════════════════════════════════════════════════════════════╗
║ Looks like Playwright was just installed or updated. ║
║ Please run the following command to download new browsers: ║
║ ║
║ playwright install ║
║ ║
║ <3 Playwright Team ║
╚════════════════════════════════════════════════════════════╝
Or the error:
INFO Error Failed to start browser: [Errno 2] No such file or directory: 'google-chrome'
I suspect that the line browser_path = self._get_browser_path() in async_crawler_strategy.py is unable to automatically detect the browser's location in the Windmill environment.
Question:
How can I properly configure something like executablePath for the browser (e.g., Chromium or Google Chrome) when using crawl4ai within Windmill?
Is there a way to manually specify the path, perhaps through an environment variable or a configuration setting within crawl4ai?
Current Behavior
Error:
Error: BrowserType.launch: Executable doesn't exist at /tmp/.cache/ms-playwright/chromium-1148/chrome-linux/chrome
╔════════════════════════════════════════════════════════════╗
║ Looks like Playwright was just installed or updated. ║
║ Please run the following command to download new browsers: ║
║ ║
║ playwright install ║
║ ║
║ <3 Playwright Team ║
╚════════════════════════════════════════════════════════════╝
Or that error:
INFO Error Failed to start browser: [Errno 2] No such file or directory: 'google-chrome'
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
# requirements:
# crawl4ai
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# import os
# os.system("playwright install")
# os.system("playwright install-deps")
# os.system("crawl4ai-setup")
async def scrape(url: str):
try:
crawler = AsyncWebCrawler(config=BrowserConfig())
await crawler.start()
browser_config = BrowserConfig(
headless=True,
extra_args=[
"--no-sandbox",
"--single-process",
"--no-zygote",
"--disable-setuid-sandbox",
"--disable-dev-shm-usage",
"--disable-gpu",
],
verbose=True,
)
crawl_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(),
exclude_external_links=True,
remove_overlay_elements=True,
process_iframes=False,
)
result = await crawler.arun(
url=url, config=crawl_config
) # Use await here as arun is likely async
return result
finally:
if "crawler" in locals() and crawler:
await crawler.close()
def main(url: str):
result = asyncio.run(scrape(url))
return result
OS
windmill.dev (cloud) - Linux?
Python version
3.11
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
Just checking if anyone had a chance to look into that issue. Any guidance would be much appreciated! 🙏
@renatocaliari Thx for trying the library. Are you able to create a code snippet example, where you simply crawl a page like https://crawl4ai.com, and using this browser? Then share it her for us to check. Thx
@renatocaliari Thx for trying the library. Are you able to create a code snippet example, where you simply crawl a page like https://crawl4ai.com, and using this browser? Then share it her for us to check. Thx
I've updated the issue with the code snippet and details of another related error.
Having the same issue.
Having the same issue.
Is there a way to manually specify the path, perhaps through an environment variable or a configuration setting within crawl4ai?
having the same issue in aws lambda
I've done some research on Windmill, and noticed that since Playwright needs to download browser binaries (Chromium, Firefox, WebKit). In Windmill's containerized environment, you'd need to ensure:
- The browsers are pre-installed in the worker environment
- Or the Playwright installation process can download them at runtime
Or, use a custom Docker image for Windmill workers that includes Playwright and browsers pre-installed. And if you're self-hosting Windmill, you have more control over the worker environment.