0.3.741 - fit_markdown flag not recognized and not set
Hi @unclecode - Just did some testing on 0.3.741, and I noticed that even when fit_markdown is set, the results.fit_markdown will always return Set flag 'fit_markdown' to True to get cleaned HTML content.
Here's how I'm calling it:
async def main(url: str):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
verbose=True,
headless=True,
fit_markdown=True,
bypass_cache=True,
word_count_threshold=1
)
if result.success:
print(result.fit_markdown)
return result
else:
print(f"Crawl failed: {result.error_message}")
I've tried adjusting a few settings, such as headless=False, but that doesn't seem to affect it.
Hi @chanmathew I made some changes, and release it in 0.3.743 tonight, then you follow this code:
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import BM25ContentFilter
async def main():
async with AsyncWebCrawler(
headless=True, # Set to False to see what is happening
verbose=True,
) as crawler:
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/Apple",
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
),
)
print(len(result.markdown))
print(len(result.fit_markdown))
print(len(result.markdown_v2.fit_markdown))
if __name__ == "__main__":
asyncio.run(main())
If you only need clean markdown, there’s no need to pass a content filter, just DefaultMarkdownGenerator() would be enough. But if you want fit markdown, you’ll need to provide a content filter. I’m planning to build multiple document filter algorithms tailored for different types of websites and content, along with new approaches for generating markdown.
Here’s how it works: if you don’t pass a user query and the website already has text, meta descriptions, and keywords, the algorithm will use that data. It then crawls the page, applies a clustering algorithm to analyze the connections between different sections, and generates a key based on the title, descriptions, and keywords. Only the relevant parts of the page are retained as the main content.
If you do pass a user query, it will use your query instead, allowing you to selectively extract just the portion of the page you want in your markdown. This adds more flexibility. Keep in mind, this is still an experimental feature.
Thanks @unclecode - got it, will test again once the release is out!
@chanmathew Dear Matthew, I already released the version so you can try it. It's 0.3.744 and it will be 0.3.745 very soon, Haha Check it, please let me know.
Had the same issues, v0.3.745 is working smooth so far @unclecode.
@jtha glad to hear that, plz let me know anything goes wrong
@jtha What would be the equivalent command in docker deployment setting? This one doesn't seem to work,
response = requests.post(
"https://examplehost.com/crawl",
headers=headers,
json={
"urls": "https://example.com",
"priority": 10,
"crawler_params": {
"fit_markdown": True,
}
}
)
@ajinkyaT Unfortunately, the advanced Markdown generator does not come with the Docker version yet. However, I will update it soon and release it next week, so you will be able to communicate with it as well.