[Bug]: Infinite Scroll Page isn't loading new content
crawl4ai version
0.4.3b3
Expected Behavior
The expected behavior should be scroll down to the bottom, wait for a few seconds to load the new content. Repeat this until there's no new products loaded
Current Behavior
Hi, I'm using crawl4ai to crawl Nike's products. For example, this page requires users to keep scrolling until all the products are loaded.
I was able to simulate the behavior using Playwright. See code below.
But crawl4ai seems to always stop loading the new content at some random point that I'm not so sure how to debug. The behavior I see from my local run is that it opens a new browser, keeps scrolling down. Sometimes, if I was lucky, the page would load new products. Other times, there was no new content loaded and it just stopped at the bottom and exited.
Also, when I use Playwright to open a browser and go to this URL, I can manually scroll down to the bottom and load new content. But if I use crawl4ai to open a browser and go to this URL, there'll be no new content loaded when I scroll down by myself.
async def infinite_scroll():
url = "https://www.nike.com/w/kids-sale-shoes-3yaepzv4dhzy7ok"
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
wait_for_images=True,
scan_full_page=True,
scroll_delay=10,
verbose=True,
)
browser_config = BrowserConfig(headless=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=url, config=crawl_config)
print(result)
async def infinite_scroll_playwright():
max_scrolls = 50
url = "https://www.nike.com/w/kids-sale-shoes-3yaepzv4dhzy7ok"
init_stdout_logging()
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto(url)
last_height = await page.evaluate("document.body.scrollHeight")
scroll_count = 0
while scroll_count < max_scrolls:
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(2) # Allow time for new content to load
new_height = await page.evaluate("document.body.scrollHeight")
if new_height == last_height:
break # No new content loaded
last_height = new_height
scroll_count += 1
content = await page.content()
await page.close()
await browser.close()
return content
Above is the code I've been experimenting. Let me know if you need more information, or if you could provide some suggestions on the configuration to get it working. Thanks in advance!
Is this reproducible?
Yes
Inputs Causing the Bug
- URL: https://www.nike.com/w/kids-sale-shoes-3yaepzv4dhzy7ok
- Settings: can see above in "Current Behavior"
Steps to Reproduce
Code snippets
Included in "Current Behavior"
OS
macOS
Python version
3.12.8
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response
@KedaSong I'll check this today!
@aravindkarnam any followup?
@KedaSong Couldn't get into it. Had some health problems. I'll check this out today.
import asyncio from crawl4ai.extraction_strategy import JsonCssExtractionStrategy from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig import json import os
async def extract(): schema = { "name": "Shoes", "baseSelector": ".product-card.product-grid__card.css-1tmiin5", "fields": [ {"name": "NAME", "selector": "div.product-card__title", "type": "text"}, {"name": "DETAIL", "selector": "div.product-card__subtitle", "type": "text"}, {"name": "PRICE", "selector": "div.product-price__wrapper.css-9xqpgk", "type": "text"}, ] }
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=extraction_strategy,
scan_full_page=True,
scroll_delay=3.5,
)
result = await crawler.arun(
url="https://www.nike.com/w/shoes-3yaepzv4dhzy7ok",
config=config
)
script_dir = os.path.dirname(os.path.abspath(__file__))
output_file = os.path.join(script_dir, "auto_load_shoes.json")
articles = json.loads(result.extracted_content)
with open(output_file, "w", encoding="utf-8") as json_file:
json.dump(articles, json_file, ensure_ascii=False, indent=4)
asyncio.run(extract())
I got the same issue.
@KedaSong I've tried you code infinite_scroll(), everytime it's scrolling till the bottom before giving me the results. I think the scroll_delay you set is little too high (10 secs), you can set it to 2 secs (It's basically just the time it waits between each scroll). But even then it scrolled the whole page.
Could you record the screen for me when it's scrolling, so I can see if there are any clues that are causing this to happen in your specific case.
@aravindkarnam Hey apologies for the late reply. I recorded 2 videos: the first one is my manual scrolling down and there'll be new products loaded up (you can see there is a spinner every time I scroll to the bottom and then new products). The other is I use crawl4ai scan_full_page flag but when it reaches the bottom, there's no spinner and no new content at all
I'm using this link for testing (not the same as the above code as this one has more pages)
https://github.com/user-attachments/assets/2e86788e-5f31-492d-a463-8a49e2fd51e6 https://github.com/user-attachments/assets/3afeef52-f5d7-4f9b-8916-ddea0d31f539
@KedaSong Alright! I figured out the issue. This is a very interesting problem actually. Let me unpack the scan_full_page, and how it works
- Get the viewport height.
- Scroll to the bottom of the page.
- Get the total height of the page.
- Scroll back to the top of the page.
- Start scrolling to the bottom of the page again.
- Continue scrolling until the bottom of the page is reached.
In the case of this Nike product listing page's case, what's happening is the infinite scroll is displaying a loader icon when resources being fetched(pending network calls). When this is slow for some scrolls, the scan_full_page is scrolling and hitting till the end of footer and evaluating that it scrolled full page (refer to step 6), so it's proceeding to close the browser and start to scrape the content it fetched so far.
The way I was able to get the program to wait till infinite scroll is truly exhausted, is with the help of wait_for. I asked it to wait till the loader(div with class name loader-bar) disappears from the DOM.
Try this code below. See how I implemented the wait condition. When the infinite scroll is truly exhausted, the loader div went away from page and I got the full product listing. (With this approach you can set the scroll delay even a bit lower like 1sec, it will keep scrolling till the loader goes away)
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, BrowserConfig
async def infinite_scroll():
url = "https://www.nike.com/w/big-kids-clothing-6ymx6zagibjzv4dh"
wait_condition = """() => {
return !document.querySelector('.loader-bar');
}"""
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
wait_for_images=True,
scan_full_page=True,
scroll_delay=1,
verbose=True,
wait_for=f"js:{wait_condition}",
)
browser_config = BrowserConfig(headless=False)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=url, config=crawl_config)
print(result.markdown.raw_markdown)
asyncio.run(infinite_scroll())
I tested this couple of times to ensure that the scraping waits till the scroll is exhausted. So I'm closing the issue.
Recently @unclecode and I, did discuss this issue and topic of having a stochastic approach(instead of hard coded wait_for condition) to determine if the page is fully loaded, based on pending network calls etc emerged, but implementing it is not as straight forward. However we'll take this on later.
@aravindkarnam thank you very much!
This was a very interesting issue. We will keep it in mind for future reference. @aravindkarnam well done! Great explanation!