Friday, December 25, 2009

How to screen scrape craigslist using Python and BeautifulSoup

Code
from BeautifulSoup import BeautifulSoup   #1
from urllib2 import urlopen               #2

site = "http://sfbay.craigslist.org/rea/" #3
html = urlopen(site)                      #4
soup = BeautifulSoup(html)                #5
postings = soup('p')                      #6

for post in postings:                     #7
    print post('a')[1].contents[0]        #8
    print post('a')[1]['href']            #9

How this works - line by line:
from BeautifulSoup import BeautifulSoup   #1

this imports the BeautifulSoup method from the BeautifulSoup module
you can get this module from here

from urllib2 import urlopen               #2

this imports the urlopen method from the urllib2 module
this module comes standard with the most recent versions of python

site = "http://sfbay.craigslist.org/rea/" #3

this defines the variable site as a craigslist site.
In this case, the san francisco bay area real estate.
you can change the site value to match the site you want to scrape
for example, san francisco bay area cars+trucks would be
"http://sfbay.craigslist.org/cta/"

html = urlopen(site)                      #4

this defines the variable html as the html source code of the site variable

soup = BeautifulSoup(html)                #5

this defines the variable soup as html that has been parsed by BeautifulSoup

postings = soup('p')                      #6

this defines the variable postings as the 'p' tags in the variable soup. postings is a resultset, i.e. a set of objects

for post in postings:                     #7

since the postings is a resultset, we need to loop through the set to get the individual post header and link

    print post('a')[1].contents[0]        #8

this returns the header of the post. this works as follows -- the header is the contents of the first 'a' tag within each 'p' tag (i.e. the postings).

Thus, post('a') returns a resultset of 'a' items in 'p' and post('a')[1] returns the second 'a' item in the set.

To understand this more clearly, i've provided the following example: If the html looks like this:
<p>
    <a>hello</a>
    <a>world<a/>
</p>
<p>
    <a>my</a>
    <a>name</a>
    <a>is</a>
</p>


then after this has been passed through soup = BeautifulSoup(html),
soup('p')[0]('a')[0] = hello
soup('p')[0]('a')[1] = world
soup('p')[1]('a')[0] = my
soup('p')[1]('a')[1] = name
soup('p')[1]('a')[2] = is

    print post('a')[1]['href']            #9


to get the link, you need to get the ['href'] of each post. Note that the link provided needs to be appended to the appropriate prefix link

2 comments:

  1. Thanks for the tutorial. Why do I get this?
    IndexError: list index out of range

    ReplyDelete
    Replies
    1. It looks like Craigslist has changed their HTML. I'll update the post

      Delete