NICAR 2023: Web scraping with Python

🔗 bit.ly/nicar23-scraping

This repo contains materials for a half-day workshop at the NICAR 2023 data journalism conference in Nashville on using Python to scrape data from websites.

The session is scheduled for Sunday, March 5, from 9 a.m. - 12:30 p.m. in room Midtown 3 on Meeting Space Level 2.

First step

Open the Terminal application. Copy and paste this text into the Terminal and hit enter:

cd Desktop/hands_on_classes/20230305-sunday-web-scraping-with-python--preregistered-attendees-only & .\env\Scripts\activate

Course outline

Do you really need to scrape this?
Process overview:
- Fetch, parse, write data to file
- Some best practices
  - Make sure you feel OK about whether your scraping project is (legally, ethically, etc.) allowable
  - Don't DDOS your target server
  - When feasible, save copies of pages locally, then scrape from those files
  - Rotate user-agent strings and other headers if necessary to avoid bot detection
Using your favorite brower's inspection tools to deconstruct the target page(s)
- See if the data is delivered to the page in a ready-to-use format, such as JSON (example)
- Is the HTML part of the actual page structure, or is it built on the fly when the page loads? (example)
- Can you open the URL directly in an incognito window and get to the same content, or does the page require a specific state to deliver the content (via search navigation, etc.)? (example)
- Are there URL query parameters that you can tweak to get different results? (example)
Choose tools that the most sense for your target page(s) -- a few popular options:
- requests and BeautifulSoup
- playwright (optionally using BeautifulSoup for the HTML parsing)
- scrapy for larger spidering/crawling tasks
Overview of our Python setup today
- Activating the virtual environment
- Jupyter notebooks
- Running .py files from the command line
Our projects today:
- Maryland WARN notices
- U.S. Senate press gallery
- IRE board members
- South Dakota lobbyist registration data
- Texas Railroad Commission complaints

Additional resources

Need to scrape on a timer? Try GitHub Actions (Other options: Using your computer's scheduler tools, putting your script on a remote server with a crontab configuration, switching to Google Apps Script and setting up time-based triggers, etc.)
A neat technique for copying data to your clipboard while scraping a Flourish visualization
Walkthrough: Class-based scraping

Running this code at home

Install Python, if you haven't already (here's our guide)
Clone or download this repo
cd into the repo directory and install the requirements, preferably into a virtual environment using your tooling of choice: pip install -r requirements.txt
playwright install
jupyter notebook to launch the notebook server

nicar23-python-scraping
nicar23-python-scraping copied to clipboard

Metadata

NICAR 2023: Web scraping with Python

🔗 bit.ly/nicar23-scraping

First step

Course outline

Additional resources

Running this code at home

← Metadata

Owner

Metadata

nicar23-python-scraping nicar23-python-scraping copied to clipboard

Metadata

NICAR 2023: Web scraping with Python

🔗 bit.ly/nicar23-scraping

First step

Course outline

Additional resources

Running this code at home

← Metadata

Owner

Metadata

nicar23-python-scraping
nicar23-python-scraping copied to clipboard