Scrapegraph-ai icon indicating copy to clipboard operation
Scrapegraph-ai copied to clipboard

FetchNode does not fetch a any links from the webpage

Open mayurdb opened this issue 1 year ago • 12 comments

Describe the bug FetchNode currently only fetches the static html content from the page and does not fetch any links. Without that multi-level scrapping won't be possible

To Reproduce

from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer
urls = ["https://www.google.com/about/careers/applications/jobs/results/"]
loader = AsyncHtmlLoader(urls)
docs = loader.load()
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)
print(docs_transformed)

Expected behavior Ideally there should be a way in the FetchNode to retrieve links from the webpage as well

mayurdb avatar May 09 '24 10:05 mayurdb

you should use search_link_node instead

VinciGit00 avatar May 09 '24 10:05 VinciGit00

True, but SearchLinkNode already expects a parsed webpage. Either we should add a way in SearchLinkNode to fetch the content from the url and then get the links or better modify FetchNode to fetch more detailed information

mayurdb avatar May 09 '24 11:05 mayurdb

For reference, created: https://github.com/langchain-ai/langchain/discussions/21480

mayurdb avatar May 09 '24 11:05 mayurdb

Ok and what if you connect a fetch with a search?

VinciGit00 avatar May 09 '24 13:05 VinciGit00

Assume a webpage structure like this Page A -> B => Page A has a link to Page B Page A -> C => Page A has a link to Page C Page B -> D => Page B has a link to Page D

Now all of these combinations are possible for which page/s have the answer to the question/task in the prompt:

  1. Only Page A
  2. Page A, C and D
  3. Page D
  4. Page B
  5. and so on......

So given a webpage,

  1. We should retrieve the answer if possible from it
  2. Continue the search in its children

We should modify the behavior of the nodes: FetchNode: Returns the content of the webpage including the web-links SearchLinkNode: Takes input already parsed page and returns the relevant weblinks for the prompt for it.

So, for a n-depth search task, starting with WebPage A, we could

  1. FetchNode (A) -> GenerateAnswer (A)
  2. FetchNode (A) -> SearchLinkNode(A) // assume B and C are As children
  3. FetchNode (B) -> GenerateAnswer (B)
  4. FetchNode (B) -> SearchLinkNode(B) // assume D is the children of A
  5. FetchNode (D) -> GenerateAnswer (D)
  6. FetchNode (D) -> SearchLinkNode(D) // assume D has no children
  7. FetchNode (C) -> GenerateAnswer (C)
  8. FetchNode (C) -> SearchLinkNode(C) // assume C has no children

Let me know your thoughts on this

mayurdb avatar May 09 '24 14:05 mayurdb

Yes, pls modify it

VinciGit00 avatar May 09 '24 14:05 VinciGit00

The bug appears to be in the FetchNode functionality of the AsyncHtmlLoader class from the langchain_community library.

Current Behavior:

  • FetchNode only retrieves the static HTML content of a webpage.
  • Links embedded within the HTML are not captured.

Impact:

  • This restricts the ability to perform multi-level scraping, where you would follow links from one page to another and extract data.

Expected Behavior:

  • FetchNode should ideally extract both the static HTML content and the links present on the webpage.

Possible Fix:

There could be two approaches to achieving this:

  1. Modify FetchNode: The code for FetchNode likely parses the HTML content using libraries like Beautiful Soup or lxml. Modifications to this parsing logic would be required to identify and extract links along with the HTML text.

  2. Introduce a new functionality: A separate function or method could be implemented within the AsyncHtmlLoader class specifically for fetching links. This function would parse the HTML content retrieved by FetchNode and extract the links.

To Reproduce:

The provided code snippet demonstrates the issue:

from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer

urls = ["https://www.google.com/about/careers/applications/jobs/results/"]
loader = AsyncHtmlLoader(urls)
docs = loader.load()
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)
print(docs_transformed)

This code attempts to load the webpage at the given URL and then convert the HTML content to text. However, since links are not fetched, the multi-level scraping functionality is hampered.

HowlingNitin avatar May 09 '24 15:05 HowlingNitin

Can you do the second option please?

VinciGit00 avatar May 09 '24 15:05 VinciGit00

from langchain_community.document_loaders import AsyncHtmlLoader from bs4 import BeautifulSoup # Assuming BeautifulSoup is not already imported

class AsyncHtmlLoader(AsyncHtmlLoader):

async def load(self):
    # Existing logic to fetch HTML content for URLs in self.urls
    # ...

    # Call the new function to extract links from each document
    for i, doc in enumerate(self.documents):
        links = self.extract_links(doc)
        self.documents[i]['links'] = links  # Add links as a new key in the document

    return self.documents

def extract_links(self, html_content):
    """
    Extracts links (href attributes from anchor tags) from the provided HTML content.

    Args:
        html_content (str): The HTML content of the webpage.

    Returns:
        list: A list of URLs extracted from the HTML content.
    """

    soup = BeautifulSoup(html_content, 'html.parser')
    links = []
    for a_tag in soup.find_all('a', href=True):
        link = a_tag['href']
        links.append(link)
    return links

HowlingNitin avatar May 09 '24 15:05 HowlingNitin

Modify the repo pls

VinciGit00 avatar May 09 '24 15:05 VinciGit00

@HowlingNitin Thanks for the details. Are you taking up these changes?

mayurdb avatar May 09 '24 17:05 mayurdb

What type should be sent to scrapegraphai.nodes.search_link_node for the input. I'm trying to figure out what it wants and I have no idea. Also there is not documentation on scrapegraphai.nodes.search_link_node. Any help would be appreciated and if I'm not in the right place, please point me to where I could ask this question. I found no other forums or groups. Cheers, Patrick Miron

DragonAngel1st avatar May 11 '24 22:05 DragonAngel1st