Web Scraping for your Mom

in a nutshell: This post will be a little different in nature as I’m still working on a couple hacks that will require a few more weeks of work. In the interim I’ll explain the very basics of web scraping/crawling. This is a great skill to have as it will open the doors to the limitless amounts of data from the web.

what you need:

  1. Python
  2. HTML Parser (PyQuery, BeautifulSoup, etc) Note that this is optional, you can do the parsing yourself with regexps
implementation: To start we’ll go over the very basic idea. The goal is to have a few starter links, and either scrape more links off of them or scrape the data and store it into a database. To do this we’ll use the Python module urllib2. Here’s is a sample starter script:
 
#!/usr/bin/env python
import sys
 
import os 
import urllib2 #Python module for making web requests
import re #Python module for using regular expressions
 

Easy so far? Now you’re all set up to start the fun stuff. For sake of example, we’ll be scraping Hacker News:

 
BASE_SITE = 'http://news.ycombinator.com/'
 
TIMEOUT = 15 request = urllib2.Request(BASE_SITE) #Create a request object
page = None
 
try:
    handler = urllib2.urlopen(request, timeout=TIMEOUT) #Make the http request
    page = handler.read() #Get the huge glob of html
    handler.close() 
except IOError: 
    print 'Failure to open site' #Handle failure 
    sys.exit(0)
 

To start, the Python request object can specify a ton more stuff and it is also how you would spoof your header. You can read about it here. Then we make the http request which will timeout after TIMEOUT seconds. If we’re successful in opening the page we can read all the html by calling read(). There are also other functions supplied that allow you to get the website status code and other fun things. Now that we have this blob, (this is where it is nice to have a html parser) we’ll use a regex to extract the data we want.

 
linkPattern = r'href="(.*?)"' #Regex for grabbing all the links on the page 
linkRe = re.compile(linkPattern) 
matchLinks = linkRe.findall(page) #Finding all the links on the page 
print matchLinks 

Bam. There you go, you now have a whole list of links in which the possibilities are endless. You can repeat the same process on each of these links to scrape their data and so on. This is a super basic example of web scraping but it can help you get off the ground.

areas to explore: Now that you know the basics you can do so much more

  • Set up a cron job to run your scraper monthly, daily, or even every second (easily save every article on Hacker News)
  • Try using an html parser to pull more information from a page
  • Gather data to use a machine learning algorithm on

Get the script here

Comments are closed.