Andreas Weigend, Social Data Revolution | MS&E 237, Stanford University, Spring 2011 | Course Wiki

Class_05: Web 1.0, Search and Recommendations

Date: April 12, 2011
Audio: weigend_stanford2011.05_2011.04.12.mp3
Initial authors:[Mike Reilly, ] [Elliot Babchick,,||,]] [Sean Rose,]


  • Homework #2 on Inluencers given out next class. It will consist of two parts: 1) defining an influencer and 2) implementing the algorithm to measure
  • The author the textbook, Matt Russell, will be here next week to field any feedback or questions you may have as we move into implementing the metrics you come up with over the next week
  • only has around 300 followers, which isn’t great. The more followers, the better the homework will be. Think about how to get people to follow us.


Homework #2 - Influencers

The next assignment will consist of two parts:
1. What is an influencer? Come up with a definition and some metric to measure
  • Just a person w/ many followers? A person with many friends?
  • A person with lots of retweets? Maybe this is better because then are voting with your own reputation to rebroadcast what they have produced
2. Implement an algorithm to measure

Some sites are already trying to measure your overall influence scores like which measures based upon:
  1. True Reach
    • Factors measured: Followers, Mutual Follows, Friends, Total Retweets, Unique Commenters, Unique Likers, Follower/Follow Ratio, Followed Back %, @ Mention Count, List Count, List Followers Count.
  2. Amplification Probability
    • Factors measured: Unique Retweeters, Unique Messages Retweeted, Likes Per Post, Comments Per Post Follower Retweet %, Unique @ Senders, Follower Mention %, Inbound Messages Per Outbound Message, Update Count.
  3. Network Influence
    • Factors measured: List inclusions, Follower/Follow Ratio, Followed Back %, Unique Senders, Unique Retweeters, Unique Commenters, Unique Likers, Influence of Followers, Influence of Retweeters and Mentioners, Influence of Friends, Influence of Likers and Commenters.
Good papers and readers on Influencers
  • Duncan Watts - basically started the field of study on influencersexternal image djw.gif
    • Can find all of his articles on his website at Yahoo Research here
    • Interesting article on whether marketers should target influencers or the general populace
    • Is it homophily (similarity) or influence? Reframing of correlation and causation
    • Are people influencing or just happen to be doing the same stuff?
    • In science, they get around this with controlled experiments/groups
    • In testing an e-business site, you don’t know whether the effect is that it's Saturday or a design change, but if the change is given to half the users, you can isolate the effect and know that there is some causality
    • Higher conversions after April 15th, holidays etc… may be due to time factor rather than change in program
    • Selling red and black umbrellas in New York, and if it rains on the day you’re selling black umbrellas, you might incorrectly attribute sales to color.

  • Sinan Aral - currently at MITexternal image news_sinan_aral_7.2010.jpg
    • Train your models on a certain subset of the data, “training data”. Test against existing data.
    • At the essence: making predictions. If one person tweets, what’s their probability of being retweeted?
    • Research Includes
      • How information flows impact information worker productivity
      • How information diffusion in massive online social networks influences demand patterns, consumer e-commerce behaviors and word of mouth marketing

What Were The Technologies and Mindsets of Web 1.0?

  • Web 1.0 one of the several "phases" of the history of the internet and is the first phase between roughly 1993 and 2001. It predates Web 2.0 which is defined by the social web and instead describes a web dominating by linking structures and content aggregators
    • A good graph depicting the phases and relevant concepts from The Paisanoexternal image webtimeline.jpg
Evolution Web 1.0,2.0 and 3.0
  • Infrastructure - TJ Watson predicted that the whole world would need only 3 or 4 computers. This updated to maybe one for every keep their recipes.
    • "I think there is a world market for maybe five computers." - Thomas Watson, chairman of IBM, 1943
    • "There is no reason anyone would want a computer in their home." - Ken Olson, president, chairman and founder of DEC
  • Search - if searching for recipes on your home computer
    • Could go file by file and look for the keyword. This is extremely costly in terms of time and computing power because every time you search you have to scan the entire database
    • Or could build an index - basically a big file with effective page numbers, URLs, or filenames where each keyword is located. Then the question becomes how do you build this index?
    • Crawl - keep following links across the web
    • Now, the problem shifts from looking for one recipe you now have all websites on the web so the question shifts from search to relevance
    • How to measure relevance? is someone clicking on it? But is it a short-click (they leave immediately) or a long-click where the user actually stays on the site for a while
    • You can measure relevance by the number of links to a given site
      • This might be a good measure because while you can control the links out of your site, you can't really influence the links in so its harder to game
      • This was eventually Google's Page Rank algorithm - where a site's relevance ranking is determine by those sites which link to it
      • Image from PRLog.orgexternal image 10235329-google-pagerank.png
      • Initially Google's Page Rank algorithm was dependent on the concept of inbound, outbound links - but also a then known concept of TF.IDF which is term frequency times inverse document frequency. This would help determine which words are important on a web page. The TF.IDF was then used to create doc profiles which essentially was a set of words with highest TF.IDF scores, together with their scores
      • Google's idea of using inbound links as a way to determine relevance also came under attack when link farms began to be created. The key to overcoming this issue is to perhaps look for second order effects - wherein the inbound links also have several relevant inbound links as well. Here's an interesting article about link farms, and how companies like JC Penney were able to exploit this weakness of Page Rank.
      • Google attempted to counter this attack by crowdsourcing spam removal.


  • Eventually, the problem shifted because someone needed to know what to search for so the web shifted to "Web 1.1" of one from search to discovery and serendipity
  • If you're searching, you know when the process is done - if looking for Domino's phone number, you know you've got it once you've called Domino's
  • If you're discovering however, you don't really know when you're done or what you're looking for necessarily. For example, on a dating site, you may not be done until you're married as can keep discovering
  • There is no one great metric for discovery
  • The guy at Facebook who does "People You Might Know" faces an issue of measuring his effectiveness. The algorithm could put up someone everyone knows and everyone will want to friend but that person probably derives no value from having a new friend so they have to balance that tradeoff.
  • The class decided that we would prefer to hear from someone who does the algorithm for relevancy for NewsFeed over People You Might Know

Time vs. Relevance

  • In doing analysis, you could take data that is very fresh and very immediate but probably won't be much of it or you can take data over a much longer period so it's a richer data set but may hide recent influences or developments
  • For example, searches over the past month on nuclear safety would be incredibly high relative to other topics where as over the past ten years would look very different
  • external image viz?q=nuclear+safety&date=all&geo=all&graph=weekly_img&sort=0&sa=N
  • Google Trends analysis for nuclear safety - highlights that your time data window is very important
  • This tradeoff is effectively unsolvable so needs to be reflected back to the user to decide
  • Topsy - really good at real-time search
    • Continuing the example of nuclear safety, a Google search for "nuclear safety" yields some news articles but also wikipedia, government agencies, etc... whereas a Topsy search is essentially all news articles from that or very recent days but you can change the time period with which you want to search on the left side
    • topsysearch.gif

Bruce D'Ambrosio

  • Was working in e-commerce and recommendations and crashing on Andreas’s floor. Now has sold his company Cleverset – “best recommendation engine ever written” in Web 1.0
  • Bruce was professor at University of Oregon, consulted to
  • Has now moved from product recommendations to content recommendations
  • He's asking the question:What’s the relevant data around doing a search?
  • No real distinction between search and recommendations or between recommendations and page construction. The question in all of them is how to choose the piece of content to display
  • Now not just internal site content but content across the whole web – how do you determine the relevance and appropriateness of a piece of content that comes from so many diverse sources and contains immense variation in the metadata associated with it.
  • Can trace developments of the web based upon what data is allowed for recommendations
    • Web 1.0 – Framing – What? They are a personalized presentation
      • Why – recommendations for the visitor to engage with the site/understand product space, find/discover stuff as well as for business
      • How - recommendations are dynamic of preferences, etc…
    • Visit Modeling to get beyond simple keyword lookup
      • Relational Modeling - This is the example Andreas talked about in the beginning of the course of maintaing a Matrix of the ith user's rating for the jth movie / product, and using this to recommend items based on users who have a similar "vector" of interests. (See chart below)

external image netflix.png
  • Collaborative Filtering and Page Rank in recommendation and search were first attempts at this relational modeling
  • Most recommendation engines today are a mishmash of several techniques - particularly collaborative filtering and another known technique called content-based recommendations. This method helps recommend items to customer C that are similar to items that have previously been rated high by C. Netflix's algorithm incorporates both methods. The content-based recommendation method enables users to get an explanation for items recommended to them by listing content-features that appeal to them.
  • Key difference w/ traditional modeling was that relational was data in pairs (this page to that page, a visitor bought this and that)
  • Social networks not particularly used at first
  • General relational modeling allows for increasing relations and complexities between products and persons, but this is a problem that nobody has gotten perfect just yet.
  • Dynamic Tracking
    • First, didn’t do any tracking
    • Then do “behavioral” targeting – take static demographics about the visitor and into offline categories to make recommendations for that category. Rapleaf is a fantastic example of a company that has collected an immense amount of information about people (see diagram below) to build static demographic profiles that are associated with your cookie as you browse the web.
      • Here the learning occurs offline

external image image1.jpg

  • Dynamic tracking – take all the known attributes at a point in time and infer some internal state estimate
    • Came to Google at 3am and typed in these keywords, and then didn’t click on links but did another search. That’s all useful data for the second search to update the results shown. Given that modified estimate, what new estimates should you return
    • Some offline learning about shoppers but also online learning about the shopper
    • Interesting article about DUI apps which use time, geography and social networks to identify DUI checkpoints, which has drawn the scorn of some US Senators: Senators ask Apple, Google, RIM to pull DUI Checkpoint Apps
  • Three components of data ignoring social
    • Visitor, piece of content and context to see content
    • Visitor – demographic information about the time of visit, browser, IP address, etc…
      • Persistent information about that viewers’ attributes
      • Some transient information about the viewer
      • Is the data attributive or relational with another entity
      • chart.jpg
      • There was a good discussion on the "philosophy" of what exactly constitues an instance vs. a set, since you could consider any set as a "meta"-instance of some larger abstraction, and we might also argue that an instance is always related to some "set" of objects. These meta-discussions

  • A behavioral targeting model identifies visitor segments/sets in terms of general properties
  • A set of visitor segments will have several instances. Set might include 25 yr old males, 30-35yr old mothers, etc…
  • Fuzzy problems can arise: Accessories might not be strictly defined, but, they will generally be cheaper things often purchased with more expensive things!
  • Form abstractions to make conclusions about items you might not know a lot about to compare and draw inferences
  • Products in a product category abstracted into brands
  • Sitechart2.jpg
  • Catalog
    • of set: average slipper costs $30
    • Relational of set: average slipper costs less than average bathrobe
    • Collaborative Filtering – product equivalent to Paige Rank of co-purchasing or co-clicking
  • Recommendations are about as actionable as Web 1.0 gets
  • Levels of analysis and action ability – pyramid from bottom to top
    • Page: to model the content and action is to show ads
    • Visit: model the intent and the situation and action is session-based marketing
      • Given search terms, can tell if you’re an expert on digital cameras so can model the context that you’re in
      • Planned vs. impulse visit, hurry vs. time to kill, ready for decision vs. info gathering, personal vs. job-related visit
    • Customer: visits strung together make a customer
      • Model demographics, psychographics, behavior in the past
    • Action: personalization, customization
      • Influencability, navigational style, early adopter, leader vs. follower, attitude to complexity and technology, price/time sensitivity
      • Very persistent property of a person of browser vs. searcher, level of curiosity
    • Network: customers get connected to a network
      • Model: apply social network research
      • Action: discounts, better service if friends are price/time sensitive
  • Exploration vs. Exploitation – trying new things versus doing what works
    • No best number but depends upon situation/context
    • Understand where worthwhile to ask situation of user