in a nutshell: This post will be a little different in nature as I’m still working on a couple hacks that will require a few more weeks of work. In the interim I’ll explain the very basics of web scraping/crawling. This is a great skill to have as it will open the doors to the limitless amounts of data from the web.
what you need:
- HTML Parser (PyQuery, BeautifulSoup, etc) Note that this is optional, you can do the parsing yourself with regexps
urllib2. Here’s is a sample starter script:
#!/usr/bin/env python import sys import os import urllib2 #Python module for making web requests import re #Python module for using regular expressions
Easy so far? Now you’re all set up to start the fun stuff. For sake of example, we’ll be scraping Hacker News:
BASE_SITE = 'http://news.ycombinator.com/' TIMEOUT = 15 request = urllib2.Request(BASE_SITE) #Create a request object page = None try: handler = urllib2.urlopen(request, timeout=TIMEOUT) #Make the http request page = handler.read() #Get the huge glob of html handler.close() except IOError: print 'Failure to open site' #Handle failure sys.exit(0)
To start, the Python request object can specify a ton more stuff and it is also how you would spoof your header. You can read about it here. Then we make the http request which will timeout after TIMEOUT seconds. If we’re successful in opening the page we can read all the html by calling read(). There are also other functions supplied that allow you to get the website status code and other fun things. Now that we have this blob, (this is where it is nice to have a html parser) we’ll use a regex to extract the data we want.
linkPattern = r'href="(.*?)"' #Regex for grabbing all the links on the page linkRe = re.compile(linkPattern) matchLinks = linkRe.findall(page) #Finding all the links on the page print matchLinks
Bam. There you go, you now have a whole list of links in which the possibilities are endless. You can repeat the same process on each of these links to scrape their data and so on. This is a super basic example of web scraping but it can help you get off the ground.
areas to explore: Now that you know the basics you can do so much more
- Set up a cron job to run your scraper monthly, daily, or even every second (easily save every article on Hacker News)
- Try using an html parser to pull more information from a page
- Gather data to use a machine learning algorithm on
Get the script here