crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

0.3.741 - fit_markdown flag not recognized and not set

Open chanmathew opened this issue 1 year ago • 6 comments

Hi @unclecode - Just did some testing on 0.3.741, and I noticed that even when fit_markdown is set, the results.fit_markdown will always return Set flag 'fit_markdown' to True to get cleaned HTML content.

Here's how I'm calling it:

async def main(url: str):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            verbose=True,
            headless=True,
            fit_markdown=True,
            bypass_cache=True,
            word_count_threshold=1
        )
        if result.success:
            print(result.fit_markdown)
            return result
        else:
            print(f"Crawl failed: {result.error_message}")

I've tried adjusting a few settings, such as headless=False, but that doesn't seem to affect it.

chanmathew avatar Nov 24 '24 02:11 chanmathew

Hi @chanmathew I made some changes, and release it in 0.3.743 tonight, then you follow this code:

from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import BM25ContentFilter

async def main():
    async with AsyncWebCrawler(
            headless=True,  # Set to False to see what is happening
            verbose=True,
    ) as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/Apple",
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
            ),
        )
        print(len(result.markdown))
        print(len(result.fit_markdown))
        print(len(result.markdown_v2.fit_markdown))

if __name__ == "__main__":
    asyncio.run(main())

If you only need clean markdown, there’s no need to pass a content filter, just DefaultMarkdownGenerator() would be enough. But if you want fit markdown, you’ll need to provide a content filter. I’m planning to build multiple document filter algorithms tailored for different types of websites and content, along with new approaches for generating markdown.

Here’s how it works: if you don’t pass a user query and the website already has text, meta descriptions, and keywords, the algorithm will use that data. It then crawls the page, applies a clustering algorithm to analyze the connections between different sections, and generates a key based on the title, descriptions, and keywords. Only the relevant parts of the page are retained as the main content.

If you do pass a user query, it will use your query instead, allowing you to selectively extract just the portion of the page you want in your markdown. This adds more flexibility. Keep in mind, this is still an experimental feature.

unclecode avatar Nov 27 '24 08:11 unclecode

Thanks @unclecode - got it, will test again once the release is out!

chanmathew avatar Nov 28 '24 04:11 chanmathew

@chanmathew Dear Matthew, I already released the version so you can try it. It's 0.3.744 and it will be 0.3.745 very soon, Haha Check it, please let me know.

unclecode avatar Nov 28 '24 11:11 unclecode

Had the same issues, v0.3.745 is working smooth so far @unclecode.

jtha avatar Nov 28 '24 15:11 jtha

@jtha glad to hear that, plz let me know anything goes wrong

unclecode avatar Nov 28 '24 15:11 unclecode

@jtha What would be the equivalent command in docker deployment setting? This one doesn't seem to work,

response = requests.post(
    "https://examplehost.com/crawl",
    headers=headers,
    json={
        "urls": "https://example.com",
        "priority": 10,
        "crawler_params": {
        "fit_markdown": True,
        }
    }
)

ajinkyaT avatar Jan 15 '25 07:01 ajinkyaT

@ajinkyaT Unfortunately, the advanced Markdown generator does not come with the Docker version yet. However, I will update it soon and release it next week, so you will be able to communicate with it as well.

unclecode avatar Jan 16 '25 12:01 unclecode