Scrapegraph-ai icon indicating copy to clipboard operation
Scrapegraph-ai copied to clipboard

SmartScraperGraph only extracts a small part of items requested

Open sillasgonzaga opened this issue 1 year ago • 4 comments

Describe the bug It's not quite an error, but I am trying to scrape this Aliexpress search page, which contains 60 products listed in the first page. However, it only returns data for 10 products. It's probably due to how the web page is loaded. Is there any parameter I could use to increase the wait time before extracting the source code of the requested page?

To Reproduce

from scrapegraphai.graphs import SmartScraperGraph, ScriptCreatorGraph, OmniScraperGraph, SmartScraperMultiGraph 

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "api_key": "MY_KEY",
        "model": "openai/gpt-4o-mini",
    },
    "library": "selenium",
    "verbose": False,
    "headless": True
}

smart_scraper_graph = SmartScraperGraph(
    prompt="Return the data about the products listed, including product id and product name",
    source="https://pt.aliexpress.com/w/wholesale-TECIDO-PAET%C3%8A-ROSA.html",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

sillasgonzaga avatar Sep 29 '24 15:09 sillasgonzaga

try with this please https://scrapegraph-doc.onrender.com/docs/Examples/extras/slow_mo

VinciGit00 avatar Sep 29 '24 17:09 VinciGit00

@VinciGit00 thanks but sadly it did not work, it kept returning just 10 results.

sillasgonzaga avatar Sep 29 '24 18:09 sillasgonzaga

have you tried to add: a config like this: graph_config = { "llm": { "api_key": openai_key, "model": "openai/gpt-4o", }, "verbose": True, "headless": False, }, headless should be false

VinciGit00 avatar Sep 29 '24 18:09 VinciGit00

tried with headless false too. Same behaviour

djds4rce avatar Oct 01 '24 10:10 djds4rce

@VinciGit00 I want to work on this issue, if you have more information on this issue then it would be helpful for me to work and if the the issue is unassigned then pls assign it to me.

SwapnilSonker avatar Dec 20 '24 02:12 SwapnilSonker

Please tell me what you want to do

VinciGit00 avatar Dec 20 '24 07:12 VinciGit00

@VinciGit00 yes sure, will just increase the loading page parameters which will help in extracting all the products after the page loads, and if there is any bug beside that I will try to handle it myself.

SwapnilSonker avatar Dec 20 '24 07:12 SwapnilSonker

@VinciGit00 PR - https://github.com/ScrapeGraphAI/Scrapegraph-ai/pull/849

I also have a second method to solve the above error as @sillasgonzaga is using selenium then what I have is a custom function.

### for the convenience imports are also added
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import time

def selenium_fetch(url, wait_time=5, scroll_pause=2):
    # Configure Selenium WebDriver
    options = Options()
    options.headless = False  # Set True for headless mode
    driver = webdriver.Chrome(options=options)

    try:
        # Open the URL
        driver.get(url)
        time.sleep(wait_time)  # Allow initial page load

        # Simulate scrolling to load more products
        last_height = driver.execute_script("return document.body.scrollHeight")
        while True:
            driver.find_element(By.TAG_NAME, "body").send_keys(Keys.END)
            time.sleep(scroll_pause)  # Allow time for additional products to load
            new_height = driver.execute_script("return document.body.scrollHeight")
            if new_height == last_height:  # Break if no new content is loaded
                break
            last_height = new_height

        # Return the full page source
        return driver.page_source

    finally:
        driver.quit()

SwapnilSonker avatar Dec 21 '24 03:12 SwapnilSonker

Pleas update to the new version

VinciGit00 avatar Mar 01 '25 20:03 VinciGit00