crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

How to use crawler.crawl to full-page scrolling?

Open helenatthais opened this issue 1 year ago • 5 comments

Despite simulate full-page scrolling feature released with 0.4.1. version, I'm struggling to make it work because I'm still not sure where to insert crawler.crawl function. The docs (https://crawl4ai.com/mkdocs/blog/releases/0.4.1/) cite the following example:

await crawler.crawl( url="https://example.com", scan_full_page=True, # Enables scrolling scroll_delay=0.2 # Waits 200ms between scrolls (optional) )

helenatthais avatar Jan 03 '25 05:01 helenatthais

@helenatthais I have fixed this problem with this PR, and here is a disscusion about some parameters for screenshot that was not mentioned : link

TheCutestCat avatar Jan 03 '25 07:01 TheCutestCat

Tried to execute the code from the referred PR and still the full scrolling page feature doesn't work:

async def main():
# Configure the browser settings
browser_config = BrowserConfig(headless=False, verbose=True)

# Set run configurations, including cache mode and markdown generator
crawl_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    screenshot=True,
    # Set these two flags
    scan_full_page=True,
    wait_for_images=True,
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(
        url='https://www.nytimes.com/ca/',
        config=crawl_config

helenatthais avatar Jan 03 '25 08:01 helenatthais

@helenatthais Hi, could you please provide more details about your setup and how you're running the code? I've tested it in my local environment and everything seems to work fine.

One possible cause of the issue might be that the original crawl4ai package is still installed. Could you check if that's the case?

TheCutestCat avatar Jan 03 '25 08:01 TheCutestCat

Sure, I installed crawl4ai with pip install crawl4ai and I've recently upgraded with --upgrade. I'm trying to run the following code to scrape Google Maps reviews:

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
import json

async def main():
    browser_config = BrowserConfig(headless=False, verbose=True)
    
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        screenshot=False,
        scan_full_page=True,
        js_code="window.scrollTo(0, document.body.scrollHeight);",
        scroll_delay=2000,
        css_selector="div.GHT2ce.NsCY4, span.wiI7pd",
        exclude_external_links=True,
        exclude_social_media_links=True,
        exclude_external_images=True,
        simulate_user=True
    )
        
    async with AsyncWebCrawler(verbose=True, config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.google.com.br/maps/place/Dra+Regina+C%C3%A9lia+de+Aquino+Barbosa/@-22.8744795,-43.3429393,17z/data=!4m8!3m7!1s0x9962d7809bdfe3:0x9871497b1081f14e!8m2!3d-22.8744795!4d-43.3403644!9m1!1b1!16s%2Fg%2F1wf2320v?entry=ttu&g_ep=EgoyMDI0MTIxMS4wIKXMDSoASAFQAw%3D%3D",
            config=crawl_config
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

helenatthais avatar Jan 03 '25 08:01 helenatthais

@helenatthais I understand that. This is because my PR hasn't been merged into the main branch yet. You can either:

Wait for the new version of crawl4ai (which should be available soon), or Use the modified original code (though this will be a bit more complex) by implementing the changes shown here: changes in PR #403

TheCutestCat avatar Jan 03 '25 08:01 TheCutestCat

@helenatthais The PR mentioned by @TheCutestCat is now released. Please check if this resolves your issue. You can reopen if the issue persists.

aravindkarnam avatar Jan 19 '25 16:01 aravindkarnam

@Pritsudo Could you please raise a new issue(bug report) with all the requested details, so we can figure out where the issue is.

aravindkarnam avatar Jan 22 '25 10:01 aravindkarnam