SmartScraperGraph only extracts a small part of items requested
Describe the bug It's not quite an error, but I am trying to scrape this Aliexpress search page, which contains 60 products listed in the first page. However, it only returns data for 10 products. It's probably due to how the web page is loaded. Is there any parameter I could use to increase the wait time before extracting the source code of the requested page?
To Reproduce
from scrapegraphai.graphs import SmartScraperGraph, ScriptCreatorGraph, OmniScraperGraph, SmartScraperMultiGraph
# Define the configuration for the scraping pipeline
graph_config = {
"llm": {
"api_key": "MY_KEY",
"model": "openai/gpt-4o-mini",
},
"library": "selenium",
"verbose": False,
"headless": True
}
smart_scraper_graph = SmartScraperGraph(
prompt="Return the data about the products listed, including product id and product name",
source="https://pt.aliexpress.com/w/wholesale-TECIDO-PAET%C3%8A-ROSA.html",
config=graph_config
)
result = smart_scraper_graph.run()
print(result)
try with this please https://scrapegraph-doc.onrender.com/docs/Examples/extras/slow_mo
@VinciGit00 thanks but sadly it did not work, it kept returning just 10 results.
have you tried to add: a config like this:
graph_config = { "llm": { "api_key": openai_key, "model": "openai/gpt-4o", }, "verbose": True, "headless": False, },
headless should be false
tried with headless false too. Same behaviour
@VinciGit00 I want to work on this issue, if you have more information on this issue then it would be helpful for me to work and if the the issue is unassigned then pls assign it to me.
Please tell me what you want to do
@VinciGit00 yes sure, will just increase the loading page parameters which will help in extracting all the products after the page loads, and if there is any bug beside that I will try to handle it myself.
@VinciGit00 PR - https://github.com/ScrapeGraphAI/Scrapegraph-ai/pull/849
I also have a second method to solve the above error as @sillasgonzaga is using selenium then what I have is a custom function.
### for the convenience imports are also added
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import time
def selenium_fetch(url, wait_time=5, scroll_pause=2):
# Configure Selenium WebDriver
options = Options()
options.headless = False # Set True for headless mode
driver = webdriver.Chrome(options=options)
try:
# Open the URL
driver.get(url)
time.sleep(wait_time) # Allow initial page load
# Simulate scrolling to load more products
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.find_element(By.TAG_NAME, "body").send_keys(Keys.END)
time.sleep(scroll_pause) # Allow time for additional products to load
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height: # Break if no new content is loaded
break
last_height = new_height
# Return the full page source
return driver.page_source
finally:
driver.quit()
Pleas update to the new version