Andreas Weigend, Social Data Revolution | MS&E 237, Stanford University, Spring 2011 | Course Wiki

Class_04: HW1 Help Session

Date: April 7, 2011
Other:
Initial authors:Jason Wei, jwei512@stanford.edu

Key Points

  • HW1 Twitter-Python Help Session

Outline


Today was the Twitter-Python Help Session for HW1 while Prof. Weigend was in Barcelona.

At the beginning of class we briefly went over how to set up Python on a PC as well as update Python to include the easy_install scripts.

Then we went line by line over an example of how to write a scraper for E! News. The code that was used during the help session is posted below. At the end of the help session I just answered general questions.



##########################################
## ASSIGNMENT 1 - IN CLASS HELP SESSION ##
##########################################

# see: http://pypi.python.org/pypi/setuptools#downloads for easy_install
# add to PATH for windows
# open command line
# > easy_install twitter
# > easy_install BeautifulSoup
# see: http://www.crummy.com/software/BeautifulSoup/documentation.html for documentation
# see: http://docs.python.org/library/stdtypes.html#dict for Python-Dictionary documentation

import twitter
import sys
import urllib2
import HTMLParser
import operator
from BeautifulSoup import BeautifulSoup

######################
## E! NEWS SCRAPERS ##
######################

URL = "http://feeds.eonline.com/eonline/topstories"

try:
page = urllib2.urlopen(URL)
except urllib2.URLError:
print 'Failed to fetch ' + item

try:
soup = BeautifulSoup(page)
except HTMLParser.HTMLParseError:
print 'Failed to parse ' + item

# at this point, would recommend printing out soup to see the RSS feed structure
tags = soup.findAll('title')

# remove 2 spurious tags
tags = tags[2:len(tags)]

# print out each utf-8 title
for tag in tags:
title = tag.string.encode("utf-8")
title

# instantiate dictionary
words = {}

for tag in tags:
# convert the string from UNICODE to UTF-8
title = tag.string.encode("utf-8")
# split the title by spaces and iterate through each word
for w in title.split():
# filter the word to remove non-alphanumeric symbols
key = filter(str.isalnum, w)
# standardize to lower case
key = key.lower()
# print out the keys (allows you to view the pre-processing result)
key
# check to make sure the map has the current word
if key in words:
# retrieve the current count from the map
count = words[key]
# increment the count
words[key] = count + 1
else:
# if map has no instance of this word, start the count at 1
words[key] = 1

# sort the dictionary by their values (i.e. which words had the highest/lowest counts?)
sorted_key_value_tuples = sorted(words.iteritems(), key=operator.itemgetter(1))

# print out sorted list
sorted_key_value_tuples