Scrapegraph-ai icon indicating copy to clipboard operation
Scrapegraph-ai copied to clipboard

Remote Playwright support

Open esistgut opened this issue 5 months ago • 1 comments

Is your feature request related to a problem? Please describe. The library uses Playwright but Playwright has some issues running on Linux distributions other than the supported ones, which pretty much includes only Ubuntu. This is an excerpt of playwright install on my local Archlinux system:

BEWARE: your OS is not officially supported by Playwright; downloading fallback build for ubuntu20.04-x64.
Downloading FFMPEG playwright build v1011 from https://cdn.playwright.dev/dbazure/download/playwright/builds/ffmpeg/1011/ffmpeg-linux.zip
2.3 MiB [====================] 100% 0.0s
FFMPEG playwright build v1011 downloaded to /home/esistgut/.cache/ms-playwright/ffmpeg-1011
Playwright Host validation warning: 
╔══════════════════════════════════════════════════════╗
║ Host system is missing dependencies to run browsers. ║
║ Missing libraries:                                   ║
║     libicudata.so.66                                 ║
║     libicui18n.so.66                                 ║
║     libicuuc.so.66                                   ║
║     libxml2.so.2                                     ║
║     libwebp.so.6                                     ║
║     libffi.so.7                                      ║
╚══════════════════════════════════════════════════════╝

Describe the solution you'd like Playwright already has a solution for this problem: https://playwright.dev/docs/docker#remote-connection but ScrapeGraph-AI should allow to use it somehow.

Describe alternatives you've considered This is a monkey patched example from the documentation:

import os
import asyncio
from dotenv import load_dotenv

# --- ScrapeGraphAI imports ---
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

# --- Monkey-patch: use a remote Playwright server for ChromiumLoader ---
from scrapegraphai.docloaders.chromium import ChromiumLoader
from playwright.async_api import async_playwright

load_dotenv()

# Read your LLM key + WS endpoint
openai_key = os.getenv("OPENAI_API_KEY")
WS_ENDPOINT = os.getenv("PW_WS_ENDPOINT", "ws://127.0.0.1:3000/")

# Replace ChromiumLoader.ascrape_playwright with a remote-connecting version
async def _remote_ascrape_playwright(self, url: str) -> str:
    # self.browser_config / self.TIMEOUT come from ChromiumLoader (ScrapeGraphAI)
    timeout_ms = getattr(self, "TIMEOUT", 30000)
    browser_config = getattr(self, "browser_config", {}) or {}

    async with async_playwright() as p:
        # Connect to the remote Playwright server (Docker)
        browser = await p.chromium.connect(WS_ENDPOINT)
        try:
            ctx = await browser.new_context(**browser_config)
            page = await ctx.new_page()
            # networkidle is a good default for JS-heavy pages
            await page.goto(url, wait_until="networkidle", timeout=timeout_ms)
            html = await page.content()
        finally:
            # Clean up the context; closing the browser ends the remote session
            await ctx.close()
            await browser.close()
        return html

# Monkey-patch it on the class so all internal calls use the remote connection
ChromiumLoader.ascrape_playwright = _remote_ascrape_playwright

# ---------------------------------------------------------------
# Your graph as usual
# ---------------------------------------------------------------
graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "openai/gpt-4o",
    },
    # Optional: pass extra options to the internal loader (e.g., proxy)
    # This is how ScrapeGraphAI forwards extra params to the loader
    "loader_kwargs": {
        # "proxy": {"http": "http://user:pass@host:port"}  # example if needed
    },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their description.",
    source="https://perinim.github.io/projects/",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

Additional context Please note: I'm new to both ScrapeGraphAI and Playwright, so I may have missed something very obvious.

esistgut avatar Aug 25 '25 11:08 esistgut

Hi @esistgut! I'm Dosu and I’m helping the Scrapegraph-ai team.

ScrapeGraphAI does not currently support remote Playwright connections (via WebSocket endpoint) out of the box. The only way to use a remote Playwright server (such as one running in Docker) is to monkey-patch the ChromiumLoader.ascrape_playwright method, as you’ve done in your example. This approach is widely used in the community and is considered the recommended workaround for now source.

There’s an open feature request to add native support for remote Playwright connections, and maintainers have acknowledged the need but haven’t implemented it yet source. Your monkey-patch is correct: it replaces the local browser launch with a remote connection using browser = await p.chromium.connect(WS_ENDPOINT). This lets you bypass OS/browser dependency issues and run Playwright in a supported environment (like Docker).

A few things to keep in mind:

  • Make sure your Playwright client version matches the remote server version for compatibility.
  • Session management and browser context cleanup are important—your example handles this well.
  • If you need to pass extra options (like proxies), use browser_config in your patch.
  • Monitor the feature request for updates if you want native support in the future.

If you run into issues with Playwright in containers or cloud environments, using a remote server is the suggested workaround source. Let me know if you need help with specific configuration or troubleshooting!

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Aug 25 '25 11:08 dosubot[bot]