crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: wrong relative links crawl in deep crawl

Open inVains opened this issue 11 months ago • 3 comments

crawl4ai version

Crawl4AI v0.5.0.post1

Expected Behavior

deep crawl ( stream= True) with relative links like below should join as: https://docs.crawl4ai.com/core/docker-deployment/ base_url: https://docs.crawl4ai.com/core/quickstart/ href: ../docker-deployment

Current Behavior

now become: https://docs.crawl4ai.com/docker-deployment/

Is this reproducible?

Yes

Inputs Causing the Bug

https://docs.crawl4ai.com/

Steps to Reproduce


Code snippets

async def main():
    async with AsyncWebCrawler() as crawler:
        async for result in await crawler.arun(
                config=CrawlerRunConfig(
                    deep_crawl_strategy=BFSDeepCrawlStrategy(
                        max_depth=1,
                        include_external=False
                    ),
                    stream=True,
                ),
                url="https://docs.crawl4ai.com/",
        ):
            pass

OS

Windows

Python version

3.12

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

cause by this line: https://github.com/unclecode/crawl4ai/blob/e1b3bfe6fb844297abf89e90824ab9a7725071f7/crawl4ai/utils.py#L2005

suggest to check the normalized criteria.

inVains avatar Mar 16 '25 17:03 inVains

@inVains I'm unable to reproduce this issue. I tried with 0.5.0post4. All the URLs were correctly normalised as expected.

Image

Did you mean to say that in result(markdown, html etc) somewhere the relative links are not correctly normalised. If this is not the case could you share some logs

aravindkarnam avatar Mar 17 '25 13:03 aravindkarnam

@aravindkarnam oh, yeah. need to start from url https://docs.crawl4ai.com/core/quickstart with max_depth=1 or max_depth=2 from https://docs.crawl4ai.com/ try this:

async def main():
    async with AsyncWebCrawler() as crawler:
        async for result in await crawler.arun(
                config=CrawlerRunConfig(
                    deep_crawl_strategy=BFSDeepCrawlStrategy(
                        max_depth=1,
                        include_external=False
                    ),
                    stream=True,
                ),
                url="https://docs.crawl4ai.com/core/quickstart",
        ):
            if result.status_code != 200:
                print(f"Fail!:{result.status_code}")
            else:
                print(f"URL: {result.url}")
                print(f"Depth: {result.metadata.get('depth', 0)}")
                print(f"parent_url: {result.metadata.get('parent_url')}")
                print(f"redirect: {result.redirected_url}")

it will fetch the url like https://docs.crawl4ai.com/content-selection, which should be https://docs.crawl4ai.com/core/content-selection

Image

inVains avatar Mar 17 '25 13:03 inVains

Root Cause Analysis: URL Path Resolution Issue

Issue

Relative URL paths with "../" were incorrectly resolving when used with certain base URLs. Specifically, when navigating one level up from a base URL like "https://docs.crawl4ai.com/advanced/advanced-features/", a relative path of "../file-downloading" was incorrectly resolving to "https://docs.crawl4ai.com/file-downloading" instead of the expected "https://docs.crawl4ai.com/advanced/file-downloading".

Root Cause

The issue stemmed from inconsistent trailing slashes in base URLs. When a base URL lacks a trailing slash, URL resolvers (like urllib.parse.urljoin in Python) treat the final segment as a file rather than a directory:

  1. With trailing slash: "https://docs.crawl4ai.com/advanced/advanced-features/"

    • "../file-downloading" resolves correctly to "https://docs.crawl4ai.com/advanced/file-downloading"
    • Path interpreted as: [domain]/[advanced]/[advanced-features]/
  2. Without trailing slash: "https://docs.crawl4ai.com/advanced/advanced-features"

    • "../file-downloading" resolves incorrectly to "https://docs.crawl4ai.com/file-downloading"
    • Path interpreted as: [domain]/[advanced]/[advanced-features (as a file)]
    • Going up one level skips both "advanced-features" and "advanced"

Solution

Modified the normalize_url function to ensure base URLs always have a trailing slash when they represent directories:

def normalize_url(href, base_url):
    """Normalize URLs to ensure consistent format"""
    from urllib.parse import urljoin, urlparse

    # Parse base URL to get components
    parsed_base = urlparse(base_url)
    if not parsed_base.scheme or not parsed_base.netloc:
        raise ValueError(f"Invalid base URL format: {base_url}")

    # Ensure base_url ends with a trailing slash if it's a directory path
    if not base_url.endswith('/'):
        base_url = base_url + '/'

    # Use urljoin to handle all cases
    normalized = urljoin(base_url, href.strip())
    return normalized

This solution ensures that relative paths are resolved correctly regardless of how the base URL was originally formatted.

Lessons Learned

  1. URL path resolution behavior depends significantly on the presence of trailing slashes
  2. When working with relative URLs, always ensure that directory paths end with a trailing slash
  3. Test URL resolution with various path combinations to catch edge cases

aravindkarnam avatar Mar 21 '25 11:03 aravindkarnam

hello @aravindkarnam,

I think there is still issue in the current code. When concatenating the urls that uses relative path no starting with /, the concatenation may be incorrect.

For example: in the page https://www.hko.gov.hk/en/index.html, there is a object: a href="bookmark.html", the code concatenate them into https://www.hko.gov.hk/en/index.html/bookmark.html. which is not correct.

Add base_url = base_url.rsplit('/',1)[0] before the if not base_url.endswith('/'): block solve the problem in this case, but I don't know will this modify the logic in other places.

CH-Tam avatar May 09 '25 02:05 CH-Tam