Andreas Weigend, Social Data Revolution | MS&E 237, Stanford University, Spring 2011 | Course Wiki Home


MS&E 237 Assignment 1 – Identifying Social Trends

Key Insights:


A Breakdown by News Category

  • Strong Correlations
    • Entertainment / Pop Culture (i.e. Law & Order, Dream Concert, celebrities, etc)
    • Sports (i.e. Manny Ramirez, Tiger Woods, etc)
    • Breaking News
  • Weak Correlations
    • World News
    • Politics
    • Business

Notable Findings / Ideas

  • “I was surprised that the major names of the Masters tournament didn’t show up as much. “Tiger Woods” showed up numerous times in the scraper than what actually being trended on Twitter. One hypothesis is that there is a disconnect between what the writers think the audience is interested in (Tiger Woods) and what is actually being talked about (Schwartzel and McIlroy). Maybe the authors used famous people to ‘hook’ the audience into clicking the article to read and then mention the relatively obscure winners. That being said, maybe Tiger Woods was the real winner in the Masters because of all the publicity he received.”

  • “In order to get better correlation between the twitter results and the web scraping results, I set about running a new twitter search based on a query of "gadget" and pulling 500 "recent" tweets to try and hone in on what people are currently talking about regarding technology. Using these tweets and the NLTK module, I found the top 40 words from these tweets. Comparing this new data to the top 50 words found through the web scrapers, similar terms started to appear, including some of the most popular terms found by the web scrapers (i.e. Apple, iPad, Tablet). These results show that although the most popular twitter trends may not really allow for good comparison with any specific segment of society, a quick tweak in terms of the search method (using a general term like gadget) can provide good results in terms of matching what people are talking about with the current news for that segment.

  • "To do a fair comparison between Twitter trends and world news trends, one would have to normalize Twitter trends to equalize tweets from all parts of the world (i.e. weight each word but inverse frequency of population of tweeters in that region... this is very similar to the concept of Term-Frequency / Inverse-Document-Frequency which is the most fundamental concept in Information Retrieval and Search)."


  • Nice tag cloud from Asha Gupta who scraped India's top news and made this graphic using Wordle.
asha_gupta_tagcloud.png

PDF Version: http://weigend.com/files/teaching/stanford/2011/homeworks/mse237_2011_hw1.pdf
Due Date: April 12th, 2011 (By Noon)
Help Session: April 7th, 2011 (In Class)

Learning Goals:
· Learn to install and program in Python
· Learn how to use the Twitter-Python API
· Learn how to write simple Web Scrapers using BeautifulSoup
· Learn how to do basic frequency analysis of text

Resources:
· Mining the Social Web – Chapters 1 and 2
· BeautifulSoup Documentation - http://www.crummy.com/software/BeautifulSoup/documentation.html
· Python Documentation - http://docs.python.org/library/stdtypes.html#dict

Submission Details:In a DOC/PDF, please include the following:
  1. The top 10 Twitter trends that you found using the Twitter-Python API
  2. What category of news you chose, and what RSS feeds you used in that category (include the URL of the RSS feeds)
  3. The top 50 most frequent words that you found using your aggregated scrapers
  4. A brief overview of your findings – did you get good results? Did a lot of the most frequent words relate to the trending topics you found on Twitter? If you didn’t get good results, hypothesize on what might have happened and what you think would work better.
  5. Your Python source code (as an Appendix)
  6. Email to mse237@gmail.com with subject line “HW1 – [YOUR SUNet ID]”

Assignment Details:
One interesting question regarding social data is: what are people talking about right now? Phrased differently, what do people currently care about? What’s on peoples’ minds? The goal of this assignment is to teach you two methodologies for detecting these social trends.
One such way is to use the well-known social service Twitter and leverage Twitter’s wealth of data to help us solve this problem. Thankfully, our friends at Twitter have already made public an API (see http://en.wikipedia.org/wiki/Application_programming_interface) that we can use to programmatically query for Twitter’s data (i.e. current trends, your tweets, your friends’ tweets, your followers, etc).
Part 1 – Setting Up Your Coding EnvironmentAnd so the first step will be to follow the instructions of Chapter 1 in the course textbook and install Python if your machine doesn’t already have it. The course textbook suggests installing a version of Python called ActivePython which comes pre-installed with Easy Install, a tool that will later help us install necessary libraries. However, for those who already have a version of Python installed, you can manually download Easy Install by going to http://pypi.python.org/pypi/setuptools#downloads and following the instructions.
Step 2 – The Twitter-Python APIOnce you have your coding environment set up, simply go to page 4 of the text, section title “Tinkering with Twitter’s API”, and learn about how to interact with Twitter’s API from your Python terminal.
Note your first job is to write code that gets the current trending topics from Twitter and then include these trending topics in your write-up. (Hint: see Example 1-3, though there is actually a bug in the course text and the domain should be “api.twitter.com” instead of “search.twitter.com” in the instantiation, everything else stays the same.)
Once you’ve successfully used Twitter’s Python API and queried for the current trending topics, it’s time to move on to the more exciting and creative part of the assignment!
Step 3 – An Intro to Web Scraping & Choosing your News CategoryNow, sometimes we get lucky and we can easily retrieve data by accessing a company’s API. However, sometimes we aren’t so lucky and we need to figure out a way to programmatically retrieve data. This is often done through web scraping (see Chapter 2 in the text, specifically page 24, section “A Breadth-First Crawl of XFN Data”), which acts as the basis for our second methodology on how to detect social trends.
Think back to a time when Twitter and Facebook did not exist (shocking I know). If you were given the task of guessing what today’s trending social topics might be, where online would you go look? My guess is at various online news sites, and that’s exactly what we are going to try and do. By looking at various online news sites and using frequency analysis (see page 7 of the text for more details) on the news article titles, we attempt to gauge which words appear most frequently and use those frequent words as our guess for what topics are trending.
Now, how we go about picking our news sites also requires some thought. For instance, if we only scrape one news site, we might miss some important words as, sensibly, a single news site might only have one or two news articles covering the same event. Then consider what might happen if we aggregated the frequency analysis of multiple news sites that covered different categories of news – say we wrote two scrapers that aggregated the word counts of BBC News and of Entertainment News. Now we might run into a situation where meaningless words like prepositions and articles get illuminated, while the important words like Libya and Bieber get lost.
Hence, it makes sense that when we choose our news sites, we first choose a single category of news to focus on, and then we choose multiple news sites which all belong to that category.
This brings us to your next job which is to pick a category of news (i.e. politics, world news, entertainment news, something super creative and cool) and then choose a minimum of 3 news sites in that category with RSS feeds that you can scrape.
Step 4 – Scraping Your RSS Feeds
And this brings us to the last step which is actually scraping the RSS feeds that you chose. Linked is an example of a web scraper that I wrote to scrape BBC News’ RSS feed, and which also provides you with an example of how to keep track of word counts and perform frequency analysis in Python:
http://dl.dropbox.com/u/5223068/bbc_example.py

For those who are new to web scraping and/or Python, I would carefully study this example (along with the ones in Chapter 2 of the text). In my example I provide numerous comments throughout my code to help walk you guys through the example, so hopefully there aren’t too many questions there.
Once you are done writing all of your scrapers, one for each RSS feed (typically the structure of each RSS feed will vary a little, and so you will have to modify each of your scrapers to account for this), keep an aggregate count of which words appear most frequently, sort them (as shown in the example), and provide me with the top 50 words in your write-up. Finally, summarize your findings and include your source code in the appendix. Really stretch yourself in the summary! What do you think went well with your scrapers? What do you think went poorly? What improvements could you suggest and try?
Below is a quick screenshot of what my BBC News scraper outputted, along with the trending topics that I received from Twitter. Note that “Moussa Koussa” was one of the trending topics at the time that I finalized this assignment, and highlighted are frequent words which related to Moussa Koussa, the Libyan foreign minister who recently defected to the UK.

sample_trend_scraper.png