[Bug]: CrawlerRunConfig is not consistent across systems/environments

Open TejaCherukuri opened this issue 1 year ago • 0 comments

crawl4ai version

0.4.3b3

Expected Behavior

I am trying to exclude the html tags while scraping the web page. Following is how I defined my CrawlerRunConfig()

run_config = CrawlerRunConfig(
            excluded_tags=["nav"],
            disable_cache=True
        )

I am using it as

async with AsyncWebCrawler() as crawler:
            # Run the crawler on a URL
            result = await crawler.arun(
                url="https://crawl4ai.com/mkdocs/api/arun/",
                config=run_config
            )

I expect to not see any nav tags inside my markdown.

Current Behavior

I could see everything on the webpage, it just doesn't filter out. Using below

run_config = CrawlerRunConfig(
            excluded_tags=["nav"],
            disable_cache=True
        )
async with AsyncWebCrawler() as crawler:
            # Run the crawler on a URL
            result = await crawler.arun(
                url="https://crawl4ai.com/mkdocs/api/arun/",
                config=run_config
            )

is as good as using

async with AsyncWebCrawler() as crawler:
            # Run the crawler on a URL
            result = await crawler.arun(
                url="https://www.example.com"
            )

What might be reason?

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

I am doing this inside VS Code. I use MacOS and zsh terminal. I tried creating two new virtual environments and only had crawl4ai dependencies installed, it still doesn't work as expected. However, I asked my friend to run the same file by sharing my file, and he managed to succeed. 

I restarted VS Code too multiple times. 

The versions of crawl4ai and playwright are same between our systems.
playwright - 1.50.0

Code snippets

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def crawl():
    try:
        run_config = CrawlerRunConfig(
            excluded_tags=["nav"],  # Remove entire tag blocks
            disable_cache=True
        )
        # Create an instance of AsyncWebCrawler
        async with AsyncWebCrawler() as crawler:
            # Run the crawler on a URL
            result = await crawler.arun(
                url="https://crawl4ai.com/mkdocs/api/arun/",
                config=run_config
            )

            # Extracted markdown content
            markdown_content = result.cleaned_html

            # Define output file path
            output_file = "crawl_test.md"

            # Write the markdown content to a file
            with open(output_file, "w", encoding="utf-8") as file:
                file.write(markdown_content)

            print(f"Markdown content successfully saved to {output_file}")

    except Exception as e:
        print(f"Error occurred: {e}")

def main():
    asyncio.run(crawl())

if __name__ == "__main__":
    main()

OS

macOS

Python version

3.10.9

Browser

Chrome

Browser version

132.0.6834.112

Error logs & Screenshots (if applicable)

Attaching my output (result.cleaned_html) still contains nav tags. crawl_test.md

Feb 12 '25 18:02 TejaCherukuri