Andreas Weigend, Social Data Revolution | MS&E 237, Stanford University, Spring 2011 | Course Wiki


MS&E 237 Homework 3 – Facebook User Analysis (Part A)
PDF Version: http://dl.dropbox.com/u/5223068/mse237_hw3a.pdf
Due Date: May 3rd, 2011 (By Noon)
Learning Goals:

· Learn about what clustering is and think about how you would cluster groups of Facebook users
· Think about the notion of people discovery and how you would recommend users to each other

Submission Details:Respond to the Google Form located at http://bit.ly/SDR2011_HW3A.
Assignment Details:
In the previous assignment, we primarily focused our attention on edge relationships in a user’s Twitter social network. Namely, by looking at the nature of a user’s relationship with his/her followers, and potentially even looking at the relationships of those followers with their followers, we tried to hone in on a specific set of quantifiable variables and ultimately derive a model for user influence.
In this upcoming assignment, we want to change things around and this time shift our focus on node analysis. More specifically, whereas in Twitter, a single user’s node is rather “empty” (i.e. outside of their 140 character tweets there isn’t much actual content to analyze), in Facebook, a person’s node contains a great wealth of information, and so not only do we have much more to analyze but we have much more to learn from.
Step 1First, what we need you to do is go to:
  1. Go to **facebook.socialdatarevolution.com**
  2. Register for an account.
  3. After registration, you should be logged into the website.
  4. Click on Connect Your Facebook Account and accept the permissions.
  5. After linking your FB account, click on the “My Connected Webservices” tab.
  6. Click on Synch Facebook Data.
  7. Once it says FILE WRITTEN, then you're done.
  8. Screenshots for these steps provided in the PDF.

We need to make sure everyone completes this as quickly as possible! Right now we’ve created a Facebook Platform that basically allows us to collect everyone in the class’ data as each student registers. From there, we will completely anonymize each student’s data (i.e. stripping away names, UIDs, etc) so that you guys can then actually play with this data in the next part of the assignment!
Just to be very clear, the data that we are collecting from each of you are:
· Your profile information (i.e. birthday, hometown, sex, religious views, political views, etc)
· Your work/education information
· Your friends
· Your profile feed and its contents
· All links you shared
· Your “likes” (i.e. any web pages, pictures, comments, etc that you “liked”)
· Your groups & upcoming events
· Your photos, videos, notes, and messages
· Your check-ins
And again, to protect everyone we are committed to anonymizing each student’s data by stripping away all names, replacing user ids/group ids/event ids/etc with randomized IDs (this preserves any potential relationships), and this data will then be promptly deleted on our side after the assignment is completed. More specifically (for those who are really interested) these are the permissions we are requesting from you:
These are the "Facebook permissions" we are requesting:

offline_access, status_update, email, user_status, user_about_me, user_activities, user_birthday, user_education_history, user_website, user_hometown, user_location, user_interests, user_relationships, user_relationship_details, user_religion_politics, user_work_history, user_likes, user_photos, user_videos, user_events, user_notes, user_groups, user_checkins, friends_checkins, friends_birthday, friends_education_history, friends_hometown, friends_location, read_stream, publish_stream

And this is the corresponding page explaining what the permissions mean and exactly what fields are returned for each permission.
https://developers.facebook.com/docs/authentication/permissions/

Step 2Once you’ve registered for the platform, we now want you to think about a couple of questions, and have these questions in mind before going into the second part of this assignment. First, for those who are unfamiliar with clustering, we recommend that you take some time and learn about hierarchical clustering and k-means clustering (Wikipedia is fine) – mainly what they do and how they differ from each other. In terms of questions, we’d like you to think about and answer the following:
1. Consider the following scenario: I belong to a marketing firm that focuses on distributing product catalogues and I approach you with a problem. From past research I know that it is sub-optimal to distribute just one catalogue to everyone, and so given our class of 50 some students, I want you to give me 3-4 clusters (i.e. groups) of students so that I can send out 3-4 different product catalogues accordingly. This marketing strategy is known as market segmentation (http://en.wikipedia.org/wiki/Market_segmentation). Now, given this task as well as the Facebook data described above, how would you go about producing such clusters? More specifically, what inputs/variables would you want to include in your model? Finally, how would you evaluate the strength of your clusters?
To help you in your thinking, we want you to consider the following. When including variables in our model, we want to take into consideration the conditional distribution of each variable within our clusters versus the unconditional distribution of each variable. What we mean by this is consider a situation where we have 40 guys and 20 girls in the class (2:1 ratio), and we want to look at whether or not the variable sex is an important one in our clustering algorithm. Well, let’s say that one of our clusters has 15 students, and within those 15 students, 10 of them are guys and 5 of them are girls (again a 2:1 ratio). In this case, the conditional and unconditional distributions are the same, and so the input variable sex was not very useful. However, if it was the case that one of the clusters had 14 girls and 1 guy, and a different cluster had 24 guys and 1 girl, then we could posit that the input variable sex is actually extremely significant. Furthermore, in this situation, not only can we get a sense for the importance of the variable to our model, but it also helps us in interpreting our cluster (i.e. one is a cluster of girls; the other is a cluster of guys). Again, in clustering, it is not enough to just produce clusters but it is equally as important to be able to characterize/interpret these clusters! Keep all these things in mind as you think about your clustering methodology.
Another interesting thing to look at is people discovery – namely, given a user A and a set of users B, which users in B would you recommend to A (Facebook does this already when they recommend potential friends to you)? To continue in our marketing analogy, let’s say you do a super job with the last assignment, and this time I come to you with the following problem: I now have very limited resources and can only send my product catalogue to a limited number of students (say 5-6 students in the class). How do I go about figuring out which students to send this to (in marketing, this is known as targeting http://en.wikipedia.org/wiki/Target_market)? Again, what inputs/variables would you want to include in your model to help you compute a recommendation score for each student in the class (the higher the score, the stronger the recommendation)? What potential uncertainties might you encounter when planning your target recommendation algorithm, and how do you plan on resolving those uncertainties? Also, think about how you would evaluate your rankings. In other words, if I were to give you two black box algorithms for computing target recommendation scores, how would you evaluate that box A gave better target recommendations than box B?

The bolded orange questions are also repeated for you on the Google Doc (which is what you need to submit).


MS&E 237 Homework 3 – Facebook User Analysis (Part B)
PDF Version: http://dl.dropbox.com/u/5223068/mse237_hw3b.pdf
Starter Code: http://dl.dropbox.com/u/5223068/hw3_starter_code.py
Due Date: May 12th, 2011 (By Noon)
Learning Goals:
· Learn about what clustering is and think about how you would cluster groups of Facebook users
· Think about the notion of people discovery and how you would recommend users to each other
· Practice implementing these metrics/algorithms in Python

Data:
http://dl.dropbox.com/u/5223068/newdata.zip

Submission Details:
Please submit a PDF/DOC by email to mse237@gmail.com with subject line “HW3 – [YOUR SUNET ID]”. In your PDF/DOC, please contain the following:

· A summary of your clustering model for Step 1 – make sure to include what inputs/variables you used and why you chose those inputs/variables.
· A plot of your clusters, color coded by cluster, and with each point representing a student in the class.
· A summary of your methodology for Step 2 – make sure to include what inputs/variables you chose when computing your recommendation score.
· All code (regardless of language) that you used to compute your clusters and/or scores.

Also, in that email, attach a 51x51 matrix in CSV form containing the pair wise student recommendation scores.

Assignment Details:
In the first part of the assignment, we asked you to think about (from the point of view of a marketing firm) how you would use Facebook inputs/variables in a cluster model to form 3-4 different clusters of users for the purpose of market segmentation. We then changed up the scenario a little and asked you to think about target marketing – namely, how you would serve user recommendations for a given product.
Now we’re going to make things a little more concrete, and give you specific scenarios (no more ambiguities) and ask you to actually go ahead and implement some of your ideas using real Facebook data (i.e. your own).
Step 1 (4 Points)
I’m a marketing rep from Microsoft’s Bing search engine division. The Bing group is partnering with Facebook to send out 3 different emails to 3 different groups of users. The first email will ask people to try out Bing’s new social search; the second email will ask people to try out Bing’s new iPad optimized app; and the third email will simply ask people to try out Bing as an alternative search engine to Google.
Previous marketing research has shown that the best way to differentiate these 3 groups of users is by looking at Facebook engagement. How we define engagement is this: someone who posts to their Facebook walls three times a day, comments on five people’s photos, and likes ten links is considered highly engaged; someone who spends five minutes on Facebook every week just looking at a couple of friends’ photos is considered not very engaged.
Furthermore, analysts at Bing believe that the most effective way of performing market segmentation is by sending the first email (i.e. social search) to the cluster of users that are most engaged with Facebook; then sending the second email (i.e. iPad app) to the cluster of users that are regularly engaged; and finally sending the third email (i.e. simple invite to use Bing) to the cluster of users that are least engaged. Your job is now to come up with a list of inputs/variables for your clustering model that at the end will produce 3 such clusters (i.e. highly engaged, regularly engaged, not very engaged). If, however, your model does better with 4 clusters, that’s fine too – as long as there are more than 2 clusters and fewer than 5, that’s okay for Bing’s purposes.
Finally, you will use the k-means clustering algorithm (http://en.wikipedia.org/wiki/K-means_clustering) and you will be need to submit both your code (any programming/statistical language is fine) for the clustering, as well as a list of inputs/variables that you used in your model, and finally a visualization of the results of your clustering model (i.e. either a 2D or 3D plot showing your 3-4 clusters of users – preferably color coded by cluster).
Step 2 (6 Points)
Now, let’s say I’m Blekko (http://blekko.com/), a hip, up and coming search engine, and after months of hard work we are finally ready to launch our beta product! However, we want to keep the number of beta users to a minimum, but at the same time we want our small pool of beta users to be a representative sample of our entire future user base – namely, we want to make sure that we get a group of users who span the spectrum of technical competency and who will approach search from many different lights. For those who are scratching their heads thinking about why Blekko might want a group of users like this, think about it this way: we want to make sure our new search engine is intuitive and useable for all sorts of users – not just programmers and web geeks.
So, our question to you is, given a user A in our class of 52 students, how would you pick a handful (say 5-6) of students that you would recommend to us as “quality beta users”? It’s important to remember that this problem is NOT exactly the same as the previous problem (though some might interpret it as the same). Whereas in the previous problem, we were strictly creating clusters based off of similarity, this problem is about people discovery – often times a good recommendation is not necessarily the person that is most similar to you (though it can be).
Just to get everyone thinking on the right track, some things to consider might be geographic segmentation (i.e. rewarding two people who are from very far away hometowns while punishing two people who are from the same hometown), or socio-economic/demographic segmentation (i.e. rewarding two people of different gender, ethnicity, etc), or going back to Facebook maybe even rewarding two users with very different numbers of friends or with very different Facebook usage behaviors. I would highly recommend reading the Wikipedia article on Target Marketing (http://en.wikipedia.org/wiki/Target_market) to continue generating potential inputs/variables to put into your recommendation score algorithm.
What you need to submit at the end is this: a 51x51 matrix as a CSV file, where each row represents a user and is sorted from lowest ID to highest ID (should be user 143688 to user 9786674), and similarly where each column represents a user and is sorted from lowest ID to highest ID. You will then calculate a recommendation score for each pair of users so that your entire 51x51 matrix will be filled out with scores at the end. As for the scores themselves, normalize your scores so that each row of your matrix (i.e. each user) has scores that are normally distributed with mean 0 and variance 1, i.e. N(0,1), and where a higher score represents a stronger recommendation (also see the FAQ below). Again, just to clarify, for entry (i,j) in your matrix, it should be the normalized recommendation score of user j with respect to user i - in other words, it's how likely you would recommend user j to user i. Finally, since the scores are normalized to be Gaussian, there is no absolute definition for what a score of 1.5 means, but rather all we know is that a score of 1.5 for a user means a stronger recommendation than a score of -1.5, etc.

Just for the sake of completeness, I just sketched a simple diagram of what your end matrix will look like:

sample_matrix.png
EVALUATION
Finally, I’ll briefly discuss how you will be evaluated on this second part. Basically I’m going to create a
“master” matrix, which will be a weighted average of my recommendation matrix (25%) and the rest of
the class’s recommendation matrices (75%). Then, for each user, I’m going to pick the highest 5 scores
(i.e. the top 5 recommendations), and use those as the baseline for that user.

Then for each of your recommendation matrices, I’m going to look at, for each user, what scores YOU
gave for those five recommended users, and compare them with the class-weighted average scores that
we computed above. I will calculate an error squared term and then I will sum up these error
squared terms for all 51 users, and this will give you your final total error squared.

Clearly, the farther your scores are from the class-weighted average scores, the larger your total error
squared will be, and so at the very end I’ll rank all students from lowest total error squared to highest
total error squared, curve those scores out of 8 (implying that the top performers will get a little extra
credit as this part of the assignment is only worth 6 points), and that will be your grade!

The People-Mixer Evaluation AlgorithmJust to preface and to be very clear on this - the results of the people-mixer algorithm will now be solely for extra credit purposes and so they will not affect your grade (so you can take a breather now).
Basically the algorithm is we will take as input the class' pool of recommendation matrices, and as output we will give each student in the class a list of 10 other students in the class. How we will get that list of 10 other students for each student is as follows: we are randomly going to select 6 of your peers, and for each peer we are going to add to your list of 10 the name of the user with the highest recommendation score (in other words, take the highest ranking user for each of the 6 randomly selected peers, thus giving you 6 high ranking recommendations). Then, for the remaining 4, we are randomly going to select 4 students in the class, so that at the end you'll have a total of 10 students, 6 of which were highly recommended by others to you, and 4 of which were completely arbitrary.
Then after class on Thursday during the people-mixer, you'll get a chance to grab some free drinks and actually go and look for and meet those 10 people. By the end of the mixer the goal is to have talked to as many of your peers as possible, reflect and gauge how well those brief meetings went, and then on your sheet of paper rank them from 1-10 with 1 being the worst recommendation and 10 being the best recommendation.
Of course, we will keep track of who recommended who to whom so that we have a way to backtrack and see which student's recommendations got high marks and which student's recommendations got low marks. From there, we'll award the top 5 students with 2 extra credit points on the assignment, and the next 10 students with 1 extra credit point, and give 0 for the rest of the class.
Appendix – How I Anonymized Your DataAnd finally, for the sake of completeness, I’ll talk briefly about how I went about anonymizing your data, that way you know what exactly I took out and what I didn’t take out. Basically, I ran a script that went through each of your Facebook data files, and I removed all names, usernames, personal web pages, all picture links, etc – basically all fields that could potentially divulge information about who that user is.
Then for the IDs, in order to preserve relationships, I didn’t remove the IDs but instead scrambled each person’s IDs with a random number generator. So, as a small example to show how relationships are preserved, let’s say you are user 500 and your friend user 1500 left a comment on your wall and also liked one of your posts. Instead of the data appearing as user 1500 left user 500 a comment and user 1500 liked one of user 500’s posts, it’ll appear as user 9305 left user 1025 a comment and user 9305 liked one of user 1025’s posts. Here, again, you’ll still know that user 1500 interacted with user 500 twice, but it’ll simply look like user 9305 interacted with user 1025 twice at which point you won’t have any ideas as to the true IDs for these two users.
Basically, just treat the IDs as if they were each user’s real IDs – but just know that if you try to use the graph API to call a lookup on any of these IDs, it’ll probably either fail or show you something random.
Finally, what I wasn’t able to anonymize were appearances of your names in text content. For instance, if in your “about me” field you say “Hi my name is jwei …”, without some serious NLP there’s no way for me to remove occurrences of names and nicknames from wall messages, wall comments, etc. In a perfect world, I’d be able to replace names with random IDs and such, but again, imagine a scenario where someone posts “Hiiii Liiiizzz” and then another person posts “Elizabeth we need to catch up” – it’s pretty much impossible to programmatically catch these things.
Hope all this makes sense. At the end of the day, the data is anonymized to the point where most people will not be able to programmatically figure out the real identities of each user, and for those awesome hackers who somehow find a loop hole around my script – well I’m sure you guys have much better things to do than mess around with this tiny data set.



FAQ (Updated May 10)
How will I be graded on the second part of this assignment?
To evaluate your recommendation scores, the teaching staff decided to run both the original algorithm (as described in the Evaluation section) as well as the in-class mixer ranking algorithm. However, as mentioned above, because the in-class mixer ranks will only be used for extra credit purposes, they will not affect your score. Instead, the 6 points you'll receive for the second part of this assignment will be based off of the mean-squared-error as calculated by the original algorithm, as well as your write up.
Are we supposed to create a similarity matrix, or a difference matrix?
Neither. I think this point both me and Andreas were pretty consistent on in our explanations - people discovery is not as simple as just choosing between using a similarity matrix or a dissimilarity matrix. The example I gave several students who asked was if I'm a user and I just looked up "Harry Potter 7" on Amazon, and all the recommendations Amazon returned me were "Harry Potter 6", "Harry Potter 5", etc, I would consider that a pretty poor item discovery algorithm. However, on the flip side, if Amazon's recommendations for me were a Barney children's book, Tom Clancy's "Rainbow Six", and some erotic adult novel, I would also consider that a pretty poor item discovery algorithm as none of those books are really even remotely related to Harry Potter. And so people (as well as product) discovery is meant to be a real life challenge that massages similarities and dissimilarities together. How do you do that exactly? I couldn't tell you and no one today can definitely tell you, but we just want you to think about this problem and do your best and hopefully have some fun at the same time.
What do you mean by normalize each row of the matrix?
Again, apologies for the ambiguity earlier - was not aware that there were so many interpretations of the "normalize"! What I mean by normalize is take your data and convert it to one that follows a normal distribution, and in our case we want the standard normal distribution (which has mean 0 and variance 1). How do you do this? You first want to subtract out the mean of your data (which effectively centers your data around 0), and then you want to divide by the standard deviation of your data (which then scales the variance of your data down to 1).
Can we use our late days on this assignment?
Yes absolutely. Andreas was simply hoping that students would have this assignment done on time so that the people-mixer after class on Thursday could be more fun, enjoyable, and exciting for everyone as then your results and your matrix will become part of the pool of matrices that we'll use to randomly construct the 10 recommendations for each student (see above for explanation of people-mixer-algorithm). But again, your late days are yours and are yours to use as you wish so if you absolutely can't finish it by Thursday then please take your time and do a great job!

In terms of what I need to submit, did anything change after today's class updates?


No. What you'll submit is exactly the same - all we did was propose an additional fun way of testing your recommendations in the wild and seeing how your peers did by actually allowing you to get a chance to meet those that your peers recommended you to meet (and yes, we understand that this perspective on 'people discovery' isn't entirely consistent with the Blekko analogy given earlier, but hopefully by making this extra credit you'll bear with us on this). So yes, for now, just focus on building your algorithm, building your matrix, and submitting that. Leave the rest to us and don't freak out about the people-mixer thing!

What kinds of software/programming languages can I use to do the K-means / data analysis?


You can essentially use whatever software or programming language you want, but ones that I would recommend for doing k-means / future data analysis would be either R or MATLAB. R is free to download and has a built in k-means method (http://stat.ethz.ch/R-manual/R-patched/library/stats/html/kmeans.html) that I'm very familiar with along with a cluster plotting method built in. MATLAB also has a built in k-means methods and plotting tools. Alternatively, some students have suggested using SciPy which is a module you can install into Python to do data analysis such as k-means clustering.

Are there timestamps in the Facebook data that I can use? Would you recommend me using them for the engagement analysis?


Yes, and yes. The timestamps are located in the "created_at" field and every wall post, every like, every comment, etc, should have a timestamp as a field.


Notable Solutions to Part A


Input Variables

Passive Presence
  • F = total number of friends. A user that has a high number of friends is well-connected on Facebook.
  • I = information quotient. This value indicates the amount of personal information a person has contributed to his or her profile.
  • G = total number of groups to which a person belongs
Active Presence
  • P = average number of posts made per day. This number includes comments or wall posts.
  • L = total number of links shared per day. When a user shares a link on Facebook, the user is contributing to the forum.
  • PL = total number of page likes per day. The act of liking a page must be preceded by the user browsing pages on Facebook, and then selecting which page the user most approves of. A single page-like suggests that the user has browsed many more pages.

ex_cluster1.png
Input Variables
  • Number of Page_Likes Per Month Recently
  • Number of profile_feed per month recently
  • Number of self posted profile_feed per moth recently
  • Number of links shared per month recently
  • Number of friends

ex_cluster2.png

Top Ranking Students for the MSE Error


hw3_top_results.png

Where the errors are the mean-squared errors divided by the total number of students (i.e. the average MSE for each student that you recommended 5 people to).