[Bug]: Wrong scoring with depth scorer
crawl4ai version
crawl4ai-0.6.2
Expected Behavior
When using the Best First Crawling Strategy together with the path depth scorer, while setting the optimal_depth to 0, I expect the crawler to first crawl the URLs with the shortest path. This would make sense because they are closer to the optimal depth.
Current Behavior
It is doing the exact opposite: it is first crawling the URLs with the longest depth.
bff_strategy.py states the following: "Lower scores are treated as higher priority."
The PathDepthScorer calculates the score like this: 1.0 / (1.0 + distance).
A distance of 0 from the optimal depth would therefore result in a score of 1, while a distance of 9 from the optimal depth would result in a score of 0.1 — which, by definition, would be considered a higher priority.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
import asyncio
from crawl4ai import *
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import PathDepthScorer
async def main():
path_depth_scorer = PathDepthScorer(optimal_depth=0)
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
url_scorer=path_depth_scorer,
max_depth=20
),
stream=True,
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://docs.crawl4ai.com/", config=config):
if result.success:
print(f"✅ Scraped {result.url}")
if __name__ == "__main__":
asyncio.run(main())
OS
Windows
Python version
3.12.0
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
After a brief review of the code, it looks like this issue could also affect the other scorers. For example: scorers.py
_FRESHNESS_SCORES = [
1.0, # Current year
0.9, # Last year
0.8, # 2 years ago
0.7, # 3 years ago
0.6, # 4 years ago
0.5, # 5 years ago
]
@Joorrit Thanks for catching this bug. Seem like you've nailed the problem right on its head. Awesome! and thanks for this clean PR. We'll review and merge it in the upcoming release.
@Joorrit Thanks for catching this bug. Seem like you've nailed the problem right on its head. Awesome! and thanks for this clean PR. We'll review and merge it in the upcoming release.
I really like the project and I'm glad I could contribute something useful to it 😊. I've just updated the pull request to also include the other scorers, along with the corresponding unit tests and documentation. Let me know if there's anything else you'd like me to adjust!
@Joorrit We are putting together a group in our discord server for community contributors. Mainly for networking and planning our roadmap as well as plan and execute cool features. But mostly for fun and banter. If you are already in our discord server, could you share with me your username, so I could add you to the group.