nicar23-python-scraping icon indicating copy to clipboard operation
nicar23-python-scraping copied to clipboard

Materials for a half-day class at NICAR23 on using Python to scrape data from websites.

NICAR 2023: Web scraping with Python

🔗 bit.ly/nicar23-scraping

This repo contains materials for a half-day workshop at the NICAR 2023 data journalism conference in Nashville on using Python to scrape data from websites.

The session is scheduled for Sunday, March 5, from 9 a.m. - 12:30 p.m. in room Midtown 3 on Meeting Space Level 2.

First step

Open the Terminal application. Copy and paste this text into the Terminal and hit enter:

cd Desktop/hands_on_classes/20230305-sunday-web-scraping-with-python--preregistered-attendees-only & .\env\Scripts\activate

Course outline

  • Do you really need to scrape this?
  • Process overview:
    • Fetch, parse, write data to file
    • Some best practices
      • Make sure you feel OK about whether your scraping project is (legally, ethically, etc.) allowable
      • Don't DDOS your target server
      • When feasible, save copies of pages locally, then scrape from those files
      • Rotate user-agent strings and other headers if necessary to avoid bot detection
  • Using your favorite brower's inspection tools to deconstruct the target page(s)
    • See if the data is delivered to the page in a ready-to-use format, such as JSON (example)
    • Is the HTML part of the actual page structure, or is it built on the fly when the page loads? (example)
    • Can you open the URL directly in an incognito window and get to the same content, or does the page require a specific state to deliver the content (via search navigation, etc.)? (example)
    • Are there URL query parameters that you can tweak to get different results? (example)
  • Choose tools that the most sense for your target page(s) -- a few popular options:
  • Overview of our Python setup today
    • Activating the virtual environment
    • Jupyter notebooks
    • Running .py files from the command line
  • Our projects today:
    • Maryland WARN notices
    • U.S. Senate press gallery
    • IRE board members
    • South Dakota lobbyist registration data
    • Texas Railroad Commission complaints

Additional resources

Running this code at home

  • Install Python, if you haven't already (here's our guide)
  • Clone or download this repo
  • cd into the repo directory and install the requirements, preferably into a virtual environment using your tooling of choice: pip install -r requirements.txt
  • playwright install
  • jupyter notebook to launch the notebook server