nicar23-python-scraping
nicar23-python-scraping copied to clipboard
Materials for a half-day class at NICAR23 on using Python to scrape data from websites.
NICAR 2023: Web scraping with Python
🔗 bit.ly/nicar23-scraping
This repo contains materials for a half-day workshop at the NICAR 2023 data journalism conference in Nashville on using Python to scrape data from websites.
The session is scheduled for Sunday, March 5, from 9 a.m. - 12:30 p.m. in room Midtown 3 on Meeting Space Level 2.
First step
Open the Terminal application. Copy and paste this text into the Terminal and hit enter:
cd Desktop/hands_on_classes/20230305-sunday-web-scraping-with-python--preregistered-attendees-only & .\env\Scripts\activate
Course outline
- Do you really need to scrape this?
- Process overview:
- Fetch, parse, write data to file
- Some best practices
- Make sure you feel OK about whether your scraping project is (legally, ethically, etc.) allowable
- Don't DDOS your target server
- When feasible, save copies of pages locally, then scrape from those files
- Rotate user-agent strings and other headers if necessary to avoid bot detection
- Using your favorite brower's inspection tools to deconstruct the target page(s)
- See if the data is delivered to the page in a ready-to-use format, such as JSON (example)
- Is the HTML part of the actual page structure, or is it built on the fly when the page loads? (example)
- Can you open the URL directly in an incognito window and get to the same content, or does the page require a specific state to deliver the content (via search navigation, etc.)? (example)
- Are there URL query parameters that you can tweak to get different results? (example)
- Choose tools that the most sense for your target page(s) -- a few popular options:
requestsandBeautifulSoupplaywright(optionally usingBeautifulSoupfor the HTML parsing)scrapyfor larger spidering/crawling tasks
- Overview of our Python setup today
- Activating the virtual environment
- Jupyter notebooks
- Running
.pyfiles from the command line
- Our projects today:
- Maryland WARN notices
- U.S. Senate press gallery
- IRE board members
- South Dakota lobbyist registration data
- Texas Railroad Commission complaints
Additional resources
- Need to scrape on a timer? Try GitHub Actions (Other options: Using your computer's scheduler tools, putting your script on a remote server with a
crontabconfiguration, switching to Google Apps Script and setting up time-based triggers, etc.) - A neat technique for copying data to your clipboard while scraping a Flourish visualization
- Walkthrough: Class-based scraping
Running this code at home
- Install Python, if you haven't already (here's our guide)
- Clone or download this repo
cdinto the repo directory and install the requirements, preferably into a virtual environment using your tooling of choice:pip install -r requirements.txtplaywright installjupyter notebookto launch the notebook server