crawl4ai
crawl4ai copied to clipboard
[Bug]: CrawlerRunConfig is not consistent across systems/environments
crawl4ai version
0.4.3b3
Expected Behavior
I am trying to exclude the html tags while scraping the web page. Following is how I defined my CrawlerRunConfig()
run_config = CrawlerRunConfig(
excluded_tags=["nav"],
disable_cache=True
)
I am using it as
async with AsyncWebCrawler() as crawler:
# Run the crawler on a URL
result = await crawler.arun(
url="https://crawl4ai.com/mkdocs/api/arun/",
config=run_config
)
I expect to not see any nav tags inside my markdown.
Current Behavior
I could see everything on the webpage, it just doesn't filter out. Using below
run_config = CrawlerRunConfig(
excluded_tags=["nav"],
disable_cache=True
)
async with AsyncWebCrawler() as crawler:
# Run the crawler on a URL
result = await crawler.arun(
url="https://crawl4ai.com/mkdocs/api/arun/",
config=run_config
)
is as good as using
async with AsyncWebCrawler() as crawler:
# Run the crawler on a URL
result = await crawler.arun(
url="https://www.example.com"
)
What might be reason?
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
I am doing this inside VS Code. I use MacOS and zsh terminal. I tried creating two new virtual environments and only had crawl4ai dependencies installed, it still doesn't work as expected. However, I asked my friend to run the same file by sharing my file, and he managed to succeed.
I restarted VS Code too multiple times.
The versions of crawl4ai and playwright are same between our systems.
playwright - 1.50.0
Code snippets
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def crawl():
try:
run_config = CrawlerRunConfig(
excluded_tags=["nav"], # Remove entire tag blocks
disable_cache=True
)
# Create an instance of AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
# Run the crawler on a URL
result = await crawler.arun(
url="https://crawl4ai.com/mkdocs/api/arun/",
config=run_config
)
# Extracted markdown content
markdown_content = result.cleaned_html
# Define output file path
output_file = "crawl_test.md"
# Write the markdown content to a file
with open(output_file, "w", encoding="utf-8") as file:
file.write(markdown_content)
print(f"Markdown content successfully saved to {output_file}")
except Exception as e:
print(f"Error occurred: {e}")
def main():
asyncio.run(crawl())
if __name__ == "__main__":
main()
OS
macOS
Python version
3.10.9
Browser
Chrome
Browser version
132.0.6834.112
Error logs & Screenshots (if applicable)
Attaching my output (result.cleaned_html) still contains nav tags. crawl_test.md