fix: Augment the information getting fetched from a webpage

Open mayurdb opened this issue 1 year ago • 0 comments

These are follow-up changes from the discussion https://github.com/VinciGit00/Scrapegraph-ai/issues/187

We are now adding a mechanism to fetch the contents of the webpage using beautifulsoup. Apart from the header and body are now also fetching all the urls on the webpage.

We will need some work to create a navigable URLs from the current ones as sometimes they are just pointing to sub-pages within the webside (see the example below)

This getting the navigable url and cleaning up the relevant urls will be taken up in a separate change

May 10 '24 08:05 mayurdb