[Bug]: LLMContentFilter is ignored
crawl4ai version
0.4.248
Expected Behavior
Filter is used and content has been passed to the LLM
Current Behavior
Filter is not used
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
1. Attempt to use the example code: https://crawl4ai.com/mkdocs/core/markdown-generation/#43-llmcontentfilter
2. Change provider to: "ollama:deepseek-r1:1.5b"
3. Enable cache bypass
4. Update URL to something more substantial
5. Run code
Code snippets
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter
async def main():
# Initialize LLM filter with specific instruction
filter = LLMContentFilter(
provider="ollama/deepseek-r1:1.5b", # or your preferred provider
api_token="your-api-token", # or use environment variable
instruction="""
Focus on extracting the core educational content.
Include:
- Key concepts and explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
Format the output as clean markdown with proper code blocks and headers.
""",
chunk_token_threshold=4096, # Adjust based on your needs
verbose=True
)
config = CrawlerRunConfig(
content_filter=filter,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.crossplane.io/latest/concepts/environment-configs", config=config)
print(result.fit_markdown) # Filtered markdown content
# print(filter.filter_content(result.html, True)) # running the filter manually on the content works
if __name__ == "__main__":
asyncio.run(main())
OS
Windows
Python version
3.12.8
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
[INIT].... → Crawl4AI 0.4.248 [FETCH]... ↓ https://docs.crossplane.io/latest/concepts/environ... | Status: True | Time: 1.08s [SCRAPE].. ◆ Processed https://docs.crossplane.io/latest/concepts/environ... | Time: 101ms [COMPLETE] ● https://docs.crossplane.io/latest/concepts/environ... | Status: True | Total: 1.18s
@Natabu Thanks for reporting this. I'll check it today.
+1
@Natabu There's an issue in the documentation. The LLMContentFilter has to be passed in via MarkdownGenerator as follows
# Initialize LLM filter with specific instruction
filter = LLMContentFilter(
provider="ollama/deepseek-r1:1.5b", # or your preferred provider
instruction="""
Focus on extracting the core educational content.
Include:
- Key concepts and explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
Format the output as clean markdown with proper code blocks and headers.
""",
chunk_token_threshold=4096, # Adjust based on your needs
verbose=True
)
md_generator = DefaultMarkdownGenerator(
content_filter=filter,
options={"ignore_links": True}
)
config = CrawlerRunConfig(
markdown_generator=md_generator,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.crossplane.io/latest/concepts/environment-configs", config=config)
print(result.markdown) # Filtered markdown content
We'll fix the documentation and the interface for CrawlerRunConfig to not not accept contentfilter directly, but to be passed in via markdown generator.
Cc: @unclecode (ref to our whatsapp conversation)
Thank you @aravindkarnam. That looks to work on my end. Appreciate your time!
Patched the documentation now. Will be updated in upcoming release. Closing the issue for now, since this isn't a bug, but a miss in documentation.
I can't seem to get it to work. I only get the markdown result but it's not LLM processed...
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, LLMConfig, BrowserConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter
import os
async def main():
llm_config = LLMConfig(
provider="openai/gpt-4o", # Try gpt-4o for sanity
api_token="env:OPENAI_API_KEY"
)
filter = LLMContentFilter(
llm_config=llm_config,
instruction="bullet points ONLY of the content.",
verbose=True
)
md_generator = DefaultMarkdownGenerator(
content_filter=filter,
options={"ignore_links": True}
)
run_config = CrawlerRunConfig(
markdown_generator=md_generator,
cache_mode=CacheMode.BYPASS,
verbose=True
)
browser_config = BrowserConfig(headless=True, verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://developer.atlassian.com/platform/forge/getting-started/",
config=run_config
)
print("Extracted markdown:", result.markdown)
if hasattr(result, "error_message") and result.error_message:
print("LLM Extraction Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
Where is my mistake?
FYI, the docs are still incorrect, and misguided me.
@tropxy Sorry about the delay. The documentation is fixed now with v0.7! Thanks again for everyone who reported this and kept following up!