crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: LLMContentFilter is ignored

Open Natabu opened this issue 1 year ago • 6 comments

crawl4ai version

0.4.248

Expected Behavior

Filter is used and content has been passed to the LLM

Current Behavior

Filter is not used

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce

1. Attempt to use the example code: https://crawl4ai.com/mkdocs/core/markdown-generation/#43-llmcontentfilter
2. Change provider to: "ollama:deepseek-r1:1.5b"
3. Enable cache bypass
4. Update URL to something more substantial
5. Run code

Code snippets

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter

async def main():
    # Initialize LLM filter with specific instruction
    filter = LLMContentFilter(
        provider="ollama/deepseek-r1:1.5b",  # or your preferred provider
        api_token="your-api-token",  # or use environment variable
        instruction="""
        Focus on extracting the core educational content.
        Include:
        - Key concepts and explanations
        - Important code examples
        - Essential technical details
        Exclude:
        - Navigation elements
        - Sidebars
        - Footer content
        Format the output as clean markdown with proper code blocks and headers.
        """,
        chunk_token_threshold=4096,  # Adjust based on your needs
        verbose=True
    )

    config = CrawlerRunConfig(
        content_filter=filter,
        cache_mode=CacheMode.BYPASS

    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://docs.crossplane.io/latest/concepts/environment-configs", config=config)
        print(result.fit_markdown)  # Filtered markdown content

        # print(filter.filter_content(result.html, True)) # running the filter manually on the content works
if __name__ == "__main__":
    asyncio.run(main())

OS

Windows

Python version

3.12.8

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

[INIT].... → Crawl4AI 0.4.248 [FETCH]... ↓ https://docs.crossplane.io/latest/concepts/environ... | Status: True | Time: 1.08s [SCRAPE].. ◆ Processed https://docs.crossplane.io/latest/concepts/environ... | Time: 101ms [COMPLETE] ● https://docs.crossplane.io/latest/concepts/environ... | Status: True | Total: 1.18s

Natabu avatar Feb 02 '25 06:02 Natabu

@Natabu Thanks for reporting this. I'll check it today.

aravindkarnam avatar Feb 05 '25 07:02 aravindkarnam

+1

marcelocorreia avatar Feb 07 '25 03:02 marcelocorreia

@Natabu There's an issue in the documentation. The LLMContentFilter has to be passed in via MarkdownGenerator as follows

    # Initialize LLM filter with specific instruction
    filter = LLMContentFilter(
        provider="ollama/deepseek-r1:1.5b",  # or your preferred provider
        instruction="""
        Focus on extracting the core educational content.
        Include:
        - Key concepts and explanations
        - Important code examples
        - Essential technical details
        Exclude:
        - Navigation elements
        - Sidebars
        - Footer content
        Format the output as clean markdown with proper code blocks and headers.
        """,
        chunk_token_threshold=4096,  # Adjust based on your needs
        verbose=True
    )

    md_generator = DefaultMarkdownGenerator(
    content_filter=filter,
    options={"ignore_links": True}
)

    config = CrawlerRunConfig(
       markdown_generator=md_generator,
        cache_mode=CacheMode.BYPASS

    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://docs.crossplane.io/latest/concepts/environment-configs", config=config)
        print(result.markdown)  # Filtered markdown content

We'll fix the documentation and the interface for CrawlerRunConfig to not not accept contentfilter directly, but to be passed in via markdown generator.

Cc: @unclecode (ref to our whatsapp conversation)

aravindkarnam avatar Feb 12 '25 13:02 aravindkarnam

Thank you @aravindkarnam. That looks to work on my end. Appreciate your time!

Natabu avatar Feb 19 '25 23:02 Natabu

Patched the documentation now. Will be updated in upcoming release. Closing the issue for now, since this isn't a bug, but a miss in documentation.

aravindkarnam avatar May 07 '25 09:05 aravindkarnam

I can't seem to get it to work. I only get the markdown result but it's not LLM processed...

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, LLMConfig, BrowserConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter
import os

async def main():
    llm_config = LLMConfig(
        provider="openai/gpt-4o",  # Try gpt-4o for sanity
        api_token="env:OPENAI_API_KEY"
    )
    filter = LLMContentFilter(
        llm_config=llm_config,
        instruction="bullet points ONLY of the content.",
        verbose=True
    )
    md_generator = DefaultMarkdownGenerator(
        content_filter=filter,
        options={"ignore_links": True}
    )
    run_config = CrawlerRunConfig(
        markdown_generator=md_generator,
        cache_mode=CacheMode.BYPASS,
        verbose=True
    )
    browser_config = BrowserConfig(headless=True, verbose=True)
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://developer.atlassian.com/platform/forge/getting-started/",
            config=run_config
        )
        print("Extracted markdown:", result.markdown)
        if hasattr(result, "error_message") and result.error_message:
            print("LLM Extraction Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

Where is my mistake?

michox avatar May 24 '25 19:05 michox

FYI, the docs are still incorrect, and misguided me.

tropxy avatar Jul 08 '25 01:07 tropxy

@tropxy Sorry about the delay. The documentation is fixed now with v0.7! Thanks again for everyone who reported this and kept following up!

aravindkarnam avatar Jul 13 '25 13:07 aravindkarnam