Showing posts with label craigslist python beautifulsoup screen scrape filter. Show all posts
Showing posts with label craigslist python beautifulsoup screen scrape filter. Show all posts

Monday, December 28, 2009

How to get BeautifulSoup to filter Craigslist HTML

Code

from BeautifulSoup import BeautifulSoup #1
from urllib2 import urlopen #2

site = "http://sfbay.craigslist.org/rea/" #3
table = BeautifulSoup(urlopen(site)) #4
items = table('p') #5
for item in items[:-1]: #6
itempostlink = item('a')[0]['href'] #7
itemlink = site[:-5] + itempostlink #8
print itemlink #9
soup = BeautifulSoup(urlopen(itemlink)) #10
body = soup('div',{'id':"userbody"})[0].contents[0] #11
print body #12


How it works - Line by line
from BeautifulSoup import BeautifulSoup                       #1
from urllib2 import urlopen #2

site = "http://sfbay.craigslist.org/rea/" #3
table = BeautifulSoup(urlopen(site)) #4
items = table('p') #5

Please see my postings on how to screen scrape craigslist for details on lines #1-#5


for item in items[:-1]:                                       #6

The last item of this resultset is disregarded as it is not a posting.

    itempostlink = item('a')[0]['href']                       #7

This will get the link for each item from the href tag

    itemlink = site[:-5] + itempostlink                       #8

This appends #7 to the truncated base site link

    print itemlink                                            #9

This displays the full link to the craigslist item

    soup = BeautifulSoup(urlopen(itemlink))                   #10

This loads the craigslist item link into BeautifulSoup

    body = soup('div',{'id':"userbody"})[0].contents[0]       #11

This extracts the body contents from the item link html

    print body                                                #12

This prints the body contents