FetchNode does not fetch a any links from the webpage
Describe the bug FetchNode currently only fetches the static html content from the page and does not fetch any links. Without that multi-level scrapping won't be possible
To Reproduce
from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer
urls = ["https://www.google.com/about/careers/applications/jobs/results/"]
loader = AsyncHtmlLoader(urls)
docs = loader.load()
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)
print(docs_transformed)
Expected behavior Ideally there should be a way in the FetchNode to retrieve links from the webpage as well
you should use search_link_node instead
True, but SearchLinkNode already expects a parsed webpage. Either we should add a way in SearchLinkNode to fetch the content from the url and then get the links or better modify FetchNode to fetch more detailed information
For reference, created: https://github.com/langchain-ai/langchain/discussions/21480
Ok and what if you connect a fetch with a search?
Assume a webpage structure like this Page A -> B => Page A has a link to Page B Page A -> C => Page A has a link to Page C Page B -> D => Page B has a link to Page D
Now all of these combinations are possible for which page/s have the answer to the question/task in the prompt:
- Only Page A
- Page A, C and D
- Page D
- Page B
- and so on......
So given a webpage,
- We should retrieve the answer if possible from it
- Continue the search in its children
We should modify the behavior of the nodes: FetchNode: Returns the content of the webpage including the web-links SearchLinkNode: Takes input already parsed page and returns the relevant weblinks for the prompt for it.
So, for a n-depth search task, starting with WebPage A, we could
- FetchNode (A) -> GenerateAnswer (A)
- FetchNode (A) -> SearchLinkNode(A) // assume B and C are As children
- FetchNode (B) -> GenerateAnswer (B)
- FetchNode (B) -> SearchLinkNode(B) // assume D is the children of A
- FetchNode (D) -> GenerateAnswer (D)
- FetchNode (D) -> SearchLinkNode(D) // assume D has no children
- FetchNode (C) -> GenerateAnswer (C)
- FetchNode (C) -> SearchLinkNode(C) // assume C has no children
Let me know your thoughts on this
Yes, pls modify it
The bug appears to be in the FetchNode functionality of the AsyncHtmlLoader class from the langchain_community library.
Current Behavior:
-
FetchNodeonly retrieves the static HTML content of a webpage. - Links embedded within the HTML are not captured.
Impact:
- This restricts the ability to perform multi-level scraping, where you would follow links from one page to another and extract data.
Expected Behavior:
-
FetchNodeshould ideally extract both the static HTML content and the links present on the webpage.
Possible Fix:
There could be two approaches to achieving this:
-
Modify
FetchNode: The code forFetchNodelikely parses the HTML content using libraries like Beautiful Soup or lxml. Modifications to this parsing logic would be required to identify and extract links along with the HTML text. -
Introduce a new functionality: A separate function or method could be implemented within the
AsyncHtmlLoaderclass specifically for fetching links. This function would parse the HTML content retrieved byFetchNodeand extract the links.
To Reproduce:
The provided code snippet demonstrates the issue:
from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer
urls = ["https://www.google.com/about/careers/applications/jobs/results/"]
loader = AsyncHtmlLoader(urls)
docs = loader.load()
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)
print(docs_transformed)
This code attempts to load the webpage at the given URL and then convert the HTML content to text. However, since links are not fetched, the multi-level scraping functionality is hampered.
Can you do the second option please?
from langchain_community.document_loaders import AsyncHtmlLoader from bs4 import BeautifulSoup # Assuming BeautifulSoup is not already imported
class AsyncHtmlLoader(AsyncHtmlLoader):
async def load(self):
# Existing logic to fetch HTML content for URLs in self.urls
# ...
# Call the new function to extract links from each document
for i, doc in enumerate(self.documents):
links = self.extract_links(doc)
self.documents[i]['links'] = links # Add links as a new key in the document
return self.documents
def extract_links(self, html_content):
"""
Extracts links (href attributes from anchor tags) from the provided HTML content.
Args:
html_content (str): The HTML content of the webpage.
Returns:
list: A list of URLs extracted from the HTML content.
"""
soup = BeautifulSoup(html_content, 'html.parser')
links = []
for a_tag in soup.find_all('a', href=True):
link = a_tag['href']
links.append(link)
return links
Modify the repo pls
@HowlingNitin Thanks for the details. Are you taking up these changes?
What type should be sent to scrapegraphai.nodes.search_link_node for the input. I'm trying to figure out what it wants and I have no idea. Also there is not documentation on scrapegraphai.nodes.search_link_node. Any help would be appreciated and if I'm not in the right place, please point me to where I could ask this question. I found no other forums or groups. Cheers, Patrick Miron