crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Wrong scoring with depth scorer

Open Joorrit opened this issue 8 months ago • 4 comments

crawl4ai version

crawl4ai-0.6.2

Expected Behavior

When using the Best First Crawling Strategy together with the path depth scorer, while setting the optimal_depth to 0, I expect the crawler to first crawl the URLs with the shortest path. This would make sense because they are closer to the optimal depth.

Current Behavior

It is doing the exact opposite: it is first crawling the URLs with the longest depth.

bff_strategy.py states the following: "Lower scores are treated as higher priority."

The PathDepthScorer calculates the score like this: 1.0 / (1.0 + distance). A distance of 0 from the optimal depth would therefore result in a score of 1, while a distance of 9 from the optimal depth would result in a score of 0.1 — which, by definition, would be considered a higher priority.

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets

import asyncio
from crawl4ai import *
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import PathDepthScorer

async def main():
    path_depth_scorer = PathDepthScorer(optimal_depth=0)

    config = CrawlerRunConfig(
            deep_crawl_strategy=BestFirstCrawlingStrategy(
                url_scorer=path_depth_scorer,
                max_depth=20
            ),
            stream=True,
        )

    async with AsyncWebCrawler() as crawler:
        async for result in await crawler.arun("https://docs.crawl4ai.com/", config=config):
            if result.success:
                print(f"✅ Scraped {result.url}")


if __name__ == "__main__":
    asyncio.run(main())

OS

Windows

Python version

3.12.0

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Joorrit avatar May 06 '25 16:05 Joorrit

After a brief review of the code, it looks like this issue could also affect the other scorers. For example: scorers.py

_FRESHNESS_SCORES = [
   1.0,    # Current year
   0.9,    # Last year
   0.8,    # 2 years ago
   0.7,    # 3 years ago
   0.6,    # 4 years ago
   0.5,    # 5 years ago
]

Joorrit avatar May 07 '25 09:05 Joorrit

@Joorrit Thanks for catching this bug. Seem like you've nailed the problem right on its head. Awesome! and thanks for this clean PR. We'll review and merge it in the upcoming release.

aravindkarnam avatar May 12 '25 12:05 aravindkarnam

@Joorrit Thanks for catching this bug. Seem like you've nailed the problem right on its head. Awesome! and thanks for this clean PR. We'll review and merge it in the upcoming release.

I really like the project and I'm glad I could contribute something useful to it 😊. I've just updated the pull request to also include the other scorers, along with the corresponding unit tests and documentation. Let me know if there's anything else you'd like me to adjust!

Joorrit avatar May 12 '25 15:05 Joorrit

@Joorrit We are putting together a group in our discord server for community contributors. Mainly for networking and planning our roadmap as well as plan and execute cool features. But mostly for fun and banter. If you are already in our discord server, could you share with me your username, so I could add you to the group.

aravindkarnam avatar May 13 '25 14:05 aravindkarnam