Andreas Weigend, Social Data Revolution | MS&E 237, Stanford University, Spring 2011 | Course Wiki

Class_18: [Analyzing Social Data]

Date: May 26, 2011
Audio: weigend_stanford2011.18_2011.05.26.mp3
Initial authors:[Karthik Venkateswaran,], [Michelle Dadourian,]

RECAP of 24 May Class : Liberating Data

  • Opening up public data
    • Allows people to access data via API and let them build relevant applications that are useful to the consumers, by both structured business logic, and trial and error.
    • unleashing power of creativity of ideas
  • liberating data → let people (Developers) play with the data, and come up with interesting solutions.
  • There is a digital divide between smartphoneusers vs. non-smartphone users but that gap appears to be closing.
  • Data Ownership Analogy: who owns the data? → who owns the park?
    • nobody owns the park, but we all have interests in the park.
    • in some way, we all own the park (and the data)
  • In the example of parking app, you could change the price of parking in real-time based on demand and supply. If no parking spots are available, and there is more demand, then prices can be raised, and vice versa. Such a transparent price estimation mechanism is possible by opening up real time parking data for public access and applications can be built that taps into such public data. Similar to congestion pricing

Learning Goals

The landscape of Social Data Analysis is changing from
  • Static analysis → interactive exploration
  • Accounting mindset → social mindset
    • accounting mindset is the dominant mindset in many companies
      • accounting mindset: “better get the couple of #s right” ← not what social data is about
      • replaced by social mindset, where data is analyzed to track performance metrics in a continous and dynamic fashion.
    • why the shift in mindset?
      • in accounting, prescriptive rules exist and those rules don’t change
      • in Cyberfraud, those rules shift all the time.
  • tools now are awesome interactive tools
  • need to be curious, creative
  • APIs: “Data having sex”

Social Data Analysis from Public Data Sources

  • Google Flu trends.

    • According to Google Flu Trends site, “Google FluTrends provides near real-time estimates of flu activity for a number ofcountries and regions around the world based on aggregated search queries”. Thisis a great example of analyzing social data to understand the trends andpatterns, and visualize the results.
  • Google Correlate.

    • Google Correlate is an experimental new tool on Google Labs which enables you to find queries with a similar pattern to a target data series. The target can either be a real-world trend that you provide (e.g., a data set of event counts over time) or a query that you enter. This tool helps find google searches that correlate with real-world data.

external image 5759805972_a58cab468c_b.jpg
For more Google Correlate comics

HCI (Human and Computer Interaction)

  • How do computers interact with humans?
    • what are computers good at vs. what are humans good at?
  • Examples:

    • Mechanical Turk:
      • Amazon Mechanical Turk is a marketplace for work that requires human intelligence. The Mechanical Turk service gives businesses access to a diverse, on-demand, scalable workforce and gives Workers a selection of thousands of tasks to complete whenever it's convenient. Amazon Mechanical Turk is based on the idea that there are still many things that human beings can do much more effectively than computers, such as identifying objects in a photo or video, performing data de-duplication, transcribing audio recordings, or researching data details. Traditionally, tasks like this have been accomplished by hiring a large temporary workforce (which is time consuming, expensive, and difficult to scale) or have gone undone.
      • reputation
      • identity
      • payment (real/virtual)
    • Crowdflower:
      • CrowdFlower is the industry leader within a specific segment of crowdsourcing, often referred to as Labor-on-Demand. Labor-on-Demand is especially useful for large-scale tasks that computers have difficulty handling, but people do well. For example:
        • verifying information
        • categorizing images and text
        • assessing relevance
        • enhancing data records
      • CrowdFlower takes large, data-heavy projects and breaks them into small tasks that we distribute to an on-demand workforce around the world. Our technology then aggregates the results and controls for quality.
    • Pornfarming
      • if you want to create a gmail account, you need to prove you’re human by copying a text box
    • TaskRabbit - helping people to outsource their errands to people around them: TaskRabbit

Online Privacy and Behavior Change Over time?

  • What has changed your behavior over time (in terms of what you put up publicly)? Increasingly, friends and family members are connecting through Facebook. The presnece of other close family members in the social network, one can’t deny friendship request or access to profile information, but one tends to look at options for censoring oneself.
    • why not put your 12 year old family member on a limited profile?
      • keep the policy of: if my family shouldn’t see it, nobody should see it
    • is it too much work to create different images of yourself? (customize each view of yourself by setting different limited profiles)
    • you can now see your profile through the eyes of somebody else on Facebook
  • Now change the question to the government: Do people change their behavior of creating social datafor the gov’t?
    • Example: need a search warrant before you can investigate w/o other’s consent (Patriot Act). Police can’t search car a without probable cause. But today, cameras can read all of the license plates on the streets
      • How do these things change our behavior?
    • If we allow the government to decrease our privacy to help keep us safe today (data is being used for good), what happens if administration changes their policy and start using our saved data in a bad way?
    • EVEN IF you’re not tagged → the picture still exists and can be tracked back to you
    • Social norm is shifting
      • Now we have to justify our reasoning to not want to participate in sharing our data
        • The opposition might ask “What’s the big deal?”
  • An Interesting story about the fall of a Wall Street trader unearthed by the Data that never dies:

Palantir: Data Analysis and Visualization.

The goal of data mining and machine learning is to develop automated techniques of inferring meaning, and finding hidden insights from data. However, a lot of complex analysis of data needs human beings in the loop. The most powerful systems combine automated data-mining and human analysis. They provide human analysts, usually domain experts in the area being investigated, with tools that let them easily "play" and "visualize" the data available. However, we are talking about vast quantities of data, from many disparate sources, in differing formats. Hence the tools that enable analysis on this data have to be sophisticated. Large scale data visualization is an emerging field which is projected to become a multi-billion dollar industry. One of the pioneers of this field is Palantir.

Palantir offers a Java-based platform for analyzing, integrating, and visualizing data of all kinds, including structured, unstructured, relational, temporal, and geospatial.

Genesis of Palantir:
In the early days of Paypal, about 3-5% of transactions on PayPal were fraudulent transactions. Since the companies profit margins were only about 2%, it was imperative for them to identify and cancel the fraudulent transactions. They started by building automated machine learning based classifiers to identify fraudulent transactions. However, classifiers learn form past data. Thus, if future data deviates from past data, then the classifiers fail. The "fraudsters" started gaming the system by figuring out what the classifiers were using as the most important features. This can be done by a simple process of trial and error based elimination.
In the end, PayPal devised the following system:
1) Use a classifier with high recall, as the expense of precision i.e. a high False Positive rate and very low False Negative rate.
2) Use human classifiers to look at all transactions in the above filtered set to finally classify the fraudulent transactions.
To enable the human classifier PayPal built a sophisticated system that would enable them to draw from multiple sources of data, visualize the data and perform analysis on the data. The System worked very well. Some PayPal executives realized that such a software system can be built and sold to customers who want to analyze vast quantities of data. Thus was born Palantir.
The team leveraged the fundamental insight that computers alone (Artificial Intelligence) could not defeat an adaptive adversary. Palantir allows human analysts to quickly explore data from many sources in conceptual ways (Intelligence Augmentation)

Palantir Technologies
  • the company has about 400 employees
  • its products are among the most expensive "boxed" software products in the world.
  • its primary customers are the intelligence arm of the government and financial institutions
  • what does half a million dollars get the customer?
    • ability to take multiple data sources and fuse them together through a conceptual viewpoint
    • take relevant data sources and analyze/create visualizations for them

Project Horizon

Project Horizon, developed as a Palantir Hack Day project on top of the Palantir platform, empowers analysts to start with their entire ecosystem of data (literally billions of rows of data), and iteratively pare the data down to discover the proverbial needle in the haystack at the speed of thought (discover the unknown unknowns). Project Horizon is part of Palantir’s approach for big data: rapid, interactive analysis of datasets that contain billions of records.

Beyond the Cloud: Project Horizon

In our in-class example of using Project Horizon, we saw how we can analyze a single data set of mortage information from over 350 million people in less than 10 seconds.
  • we looked for patterns of "predatory lending" by creating a heatmap of the dataset
    • first we must clean the data --> get rid of "dirty data"
      • 90% of time in a data analysis project is spent on cleaning the data
      • 9% of the time is spent on making the model
      • 1% of the time is spent on making money
    • hotspots are found in a number of large cities
      • we can target a subset of the data (the poor that are generally targeted by predator lenders) and recreate a new heatmap which shows Detroit standing out
        • we can do a radius search around Detroit
          • most lenders are south of 8 mile road (the road that separates poor and wealthy areas)
          • this suggests that we should further investigate certain Detroit
Here is a link to the above demo:

The Pearl Project

Another cool project completed by Palantir was a three and a half year project identifying the intricate network of co-conspirators in the kidnapping and beheading of Wall Street Journal reporter Daniel Pearl. Palantir provided its software platform to the investigators to support analysis of the data and help identify the links between the key actors involved in that heinous crime. Here's the link

Key Insights that we can learn from the success of companies like Palantir:
  • role of programmer vs statisticians in the company
    • programmers are incredibly valuable. They can leverage statistical software to build valuable products
  • take-home point: data is no longer static
  • creativity is critical.
    • The ability to mash together different, seemingly un-correlated pieces of data, and looking for latent correlations will be a very valuable skill. It requires more creativity than conventional statistics, which involved detailed analysis of homogenous data.

This service allows anyone to use powerful tools from Palantir Government to analyze data from

Cool Visualization Links

NameVoyager [arievans]
Love this NYTimes interactive visualization on U.S. President Speeches . Tons of data embedded in here. [aegupta]
An non-interactive, but pretty visual on social networks [aegupta]

World Bank Data Visualizer: This is one of the coolest and most comprehensive visualization I have ever seen. You can select different variables for axis and view the behaviour of those variables in time. Here is a video for a visualization of the history of poverty.

Hans Rosling's 200 Countries, 200 Years, 4 Minutes in visualization- The Joy of Stats - BBC [kancao]