autoscraper icon indicating copy to clipboard operation
autoscraper copied to clipboard

How to scrape a dynamic website?

Open vChavezB opened this issue 3 years ago • 2 comments

I am trying to export a localhost website that is generated with this project:

https://github.com/HBehrens/puncover

The project generates a localhost website, and each time the user interacts clicks a link the project receives a GET request and the website generates the HTML. This means that the HTML is generated each time the user access a link through their browser. At the moment the project does not export the website to html or pdf. For this reason I want to know how could I recursively get all the hyperlinks and then generate the HTML version. Would this be possible with autoscraper?

vChavezB avatar Feb 04 '22 11:02 vChavezB

It seems no one answer this yet. I don't know if the developers see this or not. But let me help you here. From the scraper file they create, they are using static scraper libraries like requests and BeautifulSoup. Dynamic website needs browser engine to execute the JavaScript parts of the web. Python has some libraries like Selenium or Playwright that using browser engine to render the JavaScript from dynamic webs and extract the HTML from them. But it seems autoscraper didn't use them. Maybe they will, or maybe not. As for November 23rd, 2022, I don't see any dynamic web scraper libraries used in the core file of this program.

P.S: Correct me if I'm wrong.

yafethtb avatar Nov 23 '22 11:11 yafethtb

You can supply a html argument to scraper.build() to use the output of your preferred HTML fetcher, so it should be compatible with Selenium with a bit of manual programming.

lrq3000 avatar Nov 24 '22 22:11 lrq3000