Web Scrapin’ Focus on Python

Join us here:

Workshop Outline

  1. Introduction: To scrape or not to scrape
    1. Sometimes it's better never to have scraped at all
      • Try asking first!
      • File->Save Page As… is your friend.
      • Consider using a GUI scraping app instead (many require subscriptions, though): Import.io, Portia, Diffbot, Extracty. Good list here.
    2. Don't pay a great deal too dear for what's given freely
      • Does the site have an API? An RSS feed?
        • JSON and XML are much easier to parse than hand-coded or auto-generated HTML.
        • Python (as well as R and other languages) has many modules that are custom-built to scrape specific web sources.
      • Look for bulk data access options (like this), or even just a big “Download” button.
  2. The scrape’s afoot: tips and tricks
    1. Best practices
      • Find the right HTML elements to target: get used to right-clicking to “inspect element” or using the “View Page Source” menu option (a good target)
      • Consider scholarly open-data requirements: if you can’t publish your results because sharing the scraped data would violate copyright or privacy, that’s a lot of wasted effort.
      • Play nice: when looping, limit requests to a few per minute, or the site may do this for you (and/or block you entirely)
      • stackoverflow.com usually has an answer
    2. Scrapable vs. unscrapable sites (and what you can do with the former)
  3. Installing Python
    1. Consider all-in one packages like Anaconda, plus lightweight development environments like Jupyter Notebook
    2. Installing PIP
  4. Not so scary up close: Python Basics
    1. Code-and-tell
      1. Twitter Scraper
      2. Articles search data and contents scraper