[Bug]: wrong relative links crawl in deep crawl
crawl4ai version
Crawl4AI v0.5.0.post1
Expected Behavior
deep crawl ( stream= True) with relative links like below should join as: https://docs.crawl4ai.com/core/docker-deployment/ base_url: https://docs.crawl4ai.com/core/quickstart/ href: ../docker-deployment
Current Behavior
now become: https://docs.crawl4ai.com/docker-deployment/
Is this reproducible?
Yes
Inputs Causing the Bug
https://docs.crawl4ai.com/
Steps to Reproduce
Code snippets
async def main():
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun(
config=CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
include_external=False
),
stream=True,
),
url="https://docs.crawl4ai.com/",
):
pass
OS
Windows
Python version
3.12
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
cause by this line: https://github.com/unclecode/crawl4ai/blob/e1b3bfe6fb844297abf89e90824ab9a7725071f7/crawl4ai/utils.py#L2005
suggest to check the normalized criteria.
@inVains I'm unable to reproduce this issue. I tried with 0.5.0post4. All the URLs were correctly normalised as expected.
Did you mean to say that in result(markdown, html etc) somewhere the relative links are not correctly normalised. If this is not the case could you share some logs
@aravindkarnam
oh, yeah. need to start from url https://docs.crawl4ai.com/core/quickstart with max_depth=1
or max_depth=2 from https://docs.crawl4ai.com/
try this:
async def main():
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun(
config=CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
include_external=False
),
stream=True,
),
url="https://docs.crawl4ai.com/core/quickstart",
):
if result.status_code != 200:
print(f"Fail!:{result.status_code}")
else:
print(f"URL: {result.url}")
print(f"Depth: {result.metadata.get('depth', 0)}")
print(f"parent_url: {result.metadata.get('parent_url')}")
print(f"redirect: {result.redirected_url}")
it will fetch the url like https://docs.crawl4ai.com/content-selection, which should be https://docs.crawl4ai.com/core/content-selection
Root Cause Analysis: URL Path Resolution Issue
Issue
Relative URL paths with "../" were incorrectly resolving when used with certain base URLs. Specifically, when navigating one level up from a base URL like "https://docs.crawl4ai.com/advanced/advanced-features/", a relative path of "../file-downloading" was incorrectly resolving to "https://docs.crawl4ai.com/file-downloading" instead of the expected "https://docs.crawl4ai.com/advanced/file-downloading".
Root Cause
The issue stemmed from inconsistent trailing slashes in base URLs. When a base URL lacks a trailing slash, URL resolvers (like urllib.parse.urljoin in Python) treat the final segment as a file rather than a directory:
-
With trailing slash: "https://docs.crawl4ai.com/advanced/advanced-features/"
- "../file-downloading" resolves correctly to "https://docs.crawl4ai.com/advanced/file-downloading"
- Path interpreted as: [domain]/[advanced]/[advanced-features]/
-
Without trailing slash: "https://docs.crawl4ai.com/advanced/advanced-features"
- "../file-downloading" resolves incorrectly to "https://docs.crawl4ai.com/file-downloading"
- Path interpreted as: [domain]/[advanced]/[advanced-features (as a file)]
- Going up one level skips both "advanced-features" and "advanced"
Solution
Modified the normalize_url function to ensure base URLs always have a trailing slash when they represent directories:
def normalize_url(href, base_url):
"""Normalize URLs to ensure consistent format"""
from urllib.parse import urljoin, urlparse
# Parse base URL to get components
parsed_base = urlparse(base_url)
if not parsed_base.scheme or not parsed_base.netloc:
raise ValueError(f"Invalid base URL format: {base_url}")
# Ensure base_url ends with a trailing slash if it's a directory path
if not base_url.endswith('/'):
base_url = base_url + '/'
# Use urljoin to handle all cases
normalized = urljoin(base_url, href.strip())
return normalized
This solution ensures that relative paths are resolved correctly regardless of how the base URL was originally formatted.
Lessons Learned
- URL path resolution behavior depends significantly on the presence of trailing slashes
- When working with relative URLs, always ensure that directory paths end with a trailing slash
- Test URL resolution with various path combinations to catch edge cases
hello @aravindkarnam,
I think there is still issue in the current code. When concatenating the urls that uses relative path no starting with /, the concatenation may be incorrect.
For example: in the page https://www.hko.gov.hk/en/index.html, there is a object: a href="bookmark.html", the code concatenate them into https://www.hko.gov.hk/en/index.html/bookmark.html. which is not correct.
Add base_url = base_url.rsplit('/',1)[0] before the if not base_url.endswith('/'): block solve the problem in this case, but I don't know will this modify the logic in other places.