Import Python: How to get BeautifulSoup to filter Craigslist HTML

Code


from BeautifulSoup import BeautifulSoup                       #1
from urllib2 import urlopen                                   #2

site = "http://sfbay.craigslist.org/rea/"                     #3 
table = BeautifulSoup(urlopen(site))                          #4 
items = table('p')                                            #5
for item in items[:-1]:                                       #6 
    itempostlink = item('a')[0]['href']                       #7
    itemlink = site[:-5] + itempostlink                       #8
    print itemlink                                            #9
    soup = BeautifulSoup(urlopen(itemlink))                   #10
    body = soup('div',{'id':"userbody"})[0].contents[0]       #11
    print body                                                #12

How it works - Line by line

from BeautifulSoup import BeautifulSoup                       #1
from urllib2 import urlopen                                   #2

site = "http://sfbay.craigslist.org/rea/"                     #3 
table = BeautifulSoup(urlopen(site))                          #4 
items = table('p')                                            #5

Please see my postings on how to screen scrape craigslist for details on lines #1-#5

for item in items[:-1]:                                       #6

The last item of this resultset is disregarded as it is not a posting.

    itempostlink = item('a')[0]['href']                       #7

This will get the link for each item from the href tag

    itemlink = site[:-5] + itempostlink                       #8

This appends #7 to the truncated base site link

    print itemlink                                            #9

This displays the full link to the craigslist item

    soup = BeautifulSoup(urlopen(itemlink))                   #10

This loads the craigslist item link into BeautifulSoup

    body = soup('div',{'id':"userbody"})[0].contents[0]       #11

This extracts the body contents from the item link html

    print body                                                #12

This prints the body contents

Import Python

Monday, December 28, 2009

How to get BeautifulSoup to filter Craigslist HTML

No comments:

Post a Comment

Search This Blog

Blog Archive

Good Reads