How to use crawler.crawl to full-page scrolling?
Despite simulate full-page scrolling feature released with 0.4.1. version, I'm struggling to make it work because I'm still not sure where to insert crawler.crawl function. The docs (https://crawl4ai.com/mkdocs/blog/releases/0.4.1/) cite the following example:
await crawler.crawl( url="https://example.com", scan_full_page=True, # Enables scrolling scroll_delay=0.2 # Waits 200ms between scrolls (optional) )
@helenatthais I have fixed this problem with this PR, and here is a disscusion about some parameters for screenshot that was not mentioned : link
Tried to execute the code from the referred PR and still the full scrolling page feature doesn't work:
async def main():
# Configure the browser settings
browser_config = BrowserConfig(headless=False, verbose=True)
# Set run configurations, including cache mode and markdown generator
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True,
# Set these two flags
scan_full_page=True,
wait_for_images=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url='https://www.nytimes.com/ca/',
config=crawl_config
@helenatthais Hi, could you please provide more details about your setup and how you're running the code? I've tested it in my local environment and everything seems to work fine.
One possible cause of the issue might be that the original crawl4ai package is still installed. Could you check if that's the case?
Sure, I installed crawl4ai with pip install crawl4ai and I've recently upgraded with --upgrade. I'm trying to run the following code to scrape Google Maps reviews:
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
import json
async def main():
browser_config = BrowserConfig(headless=False, verbose=True)
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=False,
scan_full_page=True,
js_code="window.scrollTo(0, document.body.scrollHeight);",
scroll_delay=2000,
css_selector="div.GHT2ce.NsCY4, span.wiI7pd",
exclude_external_links=True,
exclude_social_media_links=True,
exclude_external_images=True,
simulate_user=True
)
async with AsyncWebCrawler(verbose=True, config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.google.com.br/maps/place/Dra+Regina+C%C3%A9lia+de+Aquino+Barbosa/@-22.8744795,-43.3429393,17z/data=!4m8!3m7!1s0x9962d7809bdfe3:0x9871497b1081f14e!8m2!3d-22.8744795!4d-43.3403644!9m1!1b1!16s%2Fg%2F1wf2320v?entry=ttu&g_ep=EgoyMDI0MTIxMS4wIKXMDSoASAFQAw%3D%3D",
config=crawl_config
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
@helenatthais I understand that. This is because my PR hasn't been merged into the main branch yet. You can either:
Wait for the new version of crawl4ai (which should be available soon), or Use the modified original code (though this will be a bit more complex) by implementing the changes shown here: changes in PR #403
@helenatthais The PR mentioned by @TheCutestCat is now released. Please check if this resolves your issue. You can reopen if the issue persists.
@Pritsudo Could you please raise a new issue(bug report) with all the requested details, so we can figure out where the issue is.