crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Infinite Scroll Page isn't loading new content

Open KedaSong opened this issue 11 months ago • 4 comments

crawl4ai version

0.4.3b3

Expected Behavior

The expected behavior should be scroll down to the bottom, wait for a few seconds to load the new content. Repeat this until there's no new products loaded

Current Behavior

Hi, I'm using crawl4ai to crawl Nike's products. For example, this page requires users to keep scrolling until all the products are loaded.

I was able to simulate the behavior using Playwright. See code below.

But crawl4ai seems to always stop loading the new content at some random point that I'm not so sure how to debug. The behavior I see from my local run is that it opens a new browser, keeps scrolling down. Sometimes, if I was lucky, the page would load new products. Other times, there was no new content loaded and it just stopped at the bottom and exited.

Also, when I use Playwright to open a browser and go to this URL, I can manually scroll down to the bottom and load new content. But if I use crawl4ai to open a browser and go to this URL, there'll be no new content loaded when I scroll down by myself.

async def infinite_scroll():
    url = "https://www.nike.com/w/kids-sale-shoes-3yaepzv4dhzy7ok"
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        wait_for_images=True,
        scan_full_page=True,
        scroll_delay=10,
        verbose=True,
    )
    browser_config = BrowserConfig(headless=False)
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url=url, config=crawl_config)
        print(result)


async def infinite_scroll_playwright():
    max_scrolls = 50
    url = "https://www.nike.com/w/kids-sale-shoes-3yaepzv4dhzy7ok"
    init_stdout_logging()

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto(url)

        last_height = await page.evaluate("document.body.scrollHeight")
        scroll_count = 0

        while scroll_count < max_scrolls:
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await asyncio.sleep(2)  # Allow time for new content to load
            new_height = await page.evaluate("document.body.scrollHeight")

            if new_height == last_height:
                break  # No new content loaded

            last_height = new_height
            scroll_count += 1

        content = await page.content()
        await page.close()
        await browser.close()
        return content

Above is the code I've been experimenting. Let me know if you need more information, or if you could provide some suggestions on the configuration to get it working. Thanks in advance!

Is this reproducible?

Yes

Inputs Causing the Bug

- URL: https://www.nike.com/w/kids-sale-shoes-3yaepzv4dhzy7ok
- Settings: can see above in "Current Behavior"

Steps to Reproduce


Code snippets

Included in "Current Behavior"

OS

macOS

Python version

3.12.8

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

KedaSong avatar Feb 04 '25 18:02 KedaSong

@KedaSong I'll check this today!

aravindkarnam avatar Feb 05 '25 07:02 aravindkarnam

@aravindkarnam any followup?

KedaSong avatar Feb 06 '25 19:02 KedaSong

@KedaSong Couldn't get into it. Had some health problems. I'll check this out today.

aravindkarnam avatar Feb 07 '25 03:02 aravindkarnam

import asyncio from crawl4ai.extraction_strategy import JsonCssExtractionStrategy from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig import json import os

async def extract(): schema = { "name": "Shoes", "baseSelector": ".product-card.product-grid__card.css-1tmiin5", "fields": [ {"name": "NAME", "selector": "div.product-card__title", "type": "text"}, {"name": "DETAIL", "selector": "div.product-card__subtitle", "type": "text"}, {"name": "PRICE", "selector": "div.product-price__wrapper.css-9xqpgk", "type": "text"}, ] }

extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

async with AsyncWebCrawler() as crawler:
    config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=extraction_strategy,
        scan_full_page=True,  
        scroll_delay=3.5,      
    )
    result = await crawler.arun(
        url="https://www.nike.com/w/shoes-3yaepzv4dhzy7ok",
        config=config
    )

    script_dir = os.path.dirname(os.path.abspath(__file__))
    output_file = os.path.join(script_dir, "auto_load_shoes.json")

    articles = json.loads(result.extracted_content)
    with open(output_file, "w", encoding="utf-8") as json_file:
        json.dump(articles, json_file, ensure_ascii=False, indent=4)

asyncio.run(extract())

I got the same issue.

Hnam29 avatar Feb 11 '25 17:02 Hnam29

@KedaSong I've tried you code infinite_scroll(), everytime it's scrolling till the bottom before giving me the results. I think the scroll_delay you set is little too high (10 secs), you can set it to 2 secs (It's basically just the time it waits between each scroll). But even then it scrolled the whole page.

Could you record the screen for me when it's scrolling, so I can see if there are any clues that are causing this to happen in your specific case.

aravindkarnam avatar Feb 14 '25 13:02 aravindkarnam

@aravindkarnam Hey apologies for the late reply. I recorded 2 videos: the first one is my manual scrolling down and there'll be new products loaded up (you can see there is a spinner every time I scroll to the bottom and then new products). The other is I use crawl4ai scan_full_page flag but when it reaches the bottom, there's no spinner and no new content at all

I'm using this link for testing (not the same as the above code as this one has more pages)

https://github.com/user-attachments/assets/2e86788e-5f31-492d-a463-8a49e2fd51e6 https://github.com/user-attachments/assets/3afeef52-f5d7-4f9b-8916-ddea0d31f539

KedaSong avatar Mar 17 '25 18:03 KedaSong

@KedaSong Alright! I figured out the issue. This is a very interesting problem actually. Let me unpack the scan_full_page, and how it works

  1. Get the viewport height.
  2. Scroll to the bottom of the page.
  3. Get the total height of the page.
  4. Scroll back to the top of the page.
  5. Start scrolling to the bottom of the page again.
  6. Continue scrolling until the bottom of the page is reached.

In the case of this Nike product listing page's case, what's happening is the infinite scroll is displaying a loader icon when resources being fetched(pending network calls). When this is slow for some scrolls, the scan_full_page is scrolling and hitting till the end of footer and evaluating that it scrolled full page (refer to step 6), so it's proceeding to close the browser and start to scrape the content it fetched so far.

The way I was able to get the program to wait till infinite scroll is truly exhausted, is with the help of wait_for. I asked it to wait till the loader(div with class name loader-bar) disappears from the DOM.

Image

Try this code below. See how I implemented the wait condition. When the infinite scroll is truly exhausted, the loader div went away from page and I got the full product listing. (With this approach you can set the scroll delay even a bit lower like 1sec, it will keep scrolling till the loader goes away)

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, BrowserConfig


async def infinite_scroll():
    url = "https://www.nike.com/w/big-kids-clothing-6ymx6zagibjzv4dh"
    wait_condition = """() => {
        return !document.querySelector('.loader-bar');
        }"""
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        wait_for_images=True,
        scan_full_page=True,
        scroll_delay=1,
        verbose=True,
        wait_for=f"js:{wait_condition}",
    )
    browser_config = BrowserConfig(headless=False)
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url=url, config=crawl_config)
        print(result.markdown.raw_markdown)

asyncio.run(infinite_scroll())

I tested this couple of times to ensure that the scraping waits till the scroll is exhausted. So I'm closing the issue.

Recently @unclecode and I, did discuss this issue and topic of having a stochastic approach(instead of hard coded wait_for condition) to determine if the page is fully loaded, based on pending network calls etc emerged, but implementing it is not as straight forward. However we'll take this on later.

aravindkarnam avatar Mar 18 '25 05:03 aravindkarnam

@aravindkarnam thank you very much!

KedaSong avatar Mar 18 '25 16:03 KedaSong

This was a very interesting issue. We will keep it in mind for future reference. @aravindkarnam well done! Great explanation!

prokopis3 avatar Mar 18 '25 22:03 prokopis3