Data Science & Psychology Data Science applied to Values, Morals, Politics, & things that matter.

5Apr/12

Big Data Should Measure Value Fit

I gave a presentation at South by Southwest earlier this month.  I appreciate the many people who voted for my idea, who attended my talk, and who gave me feedback via twitter or face to face afterwards. It was a great experience.

It was a great experience, not for the people I met or for the thrill of speaking , both of which were nice, but more so because it forced me to think deeply about what I wanted to say.  A famous writer once said that “How do I know what I think until I see what I say?”.  My thoughts are still evolving (one person, who was positive about the talk, commented to me after that she could see my thoughts evolve on stage), and if I did the presentation over, I would frame it differently, but what I believe I arrived at, is this: Big data should measure value fit.  Or perhaps more generally, the proliferation of data should be used to measure the intangible things that we say are important to us.

Here is more or less what I ended up saying in narrated powerpoint:

The Moral Psychology and Big Data Singularity - Ravi Iyer SXSWi

I was happy with my talk, but I will try to simplify things a bit the next time I do it.  Rather than present more cool findings from psychology, which are endless but ultimately forgotten, I would have focused more clearly on the point I started with: that we need to bridge the gap between the things we say we care about and the things that we measure.

Just as countries are starting to question whether measuring gross domestic product is a good measurement of that which is worthwhile, companies should start to question whether measuring profits/monthly unique visitors/return on investment/facebook likes/valuation, is measuring that which is worthwhile.  A recurring theme at South by Southwest was a focus on the importance of values and happiness as evidenced by talks with names like "Go Forth and make Awesomeness:  Core Values & Action" or "Why Happiness is the new Currency?".  But while companies talk about values and happiness as outcomes, they don’t measure them, perhaps because they feel like they can’t measure the intangible.  Moral psychology and positive psychology, which deal with the quantification of values and happiness related constructs, can provide this methodology so that big data can eventually be used to measure the right things.

Once you start to think in this way, you can see this need everywhere. On cue, a friend recently sent me this article from the New York Times, that illustrates the points I make.  It’s by a courageous Goldman Sachs employee who quit because of he felt, in the terms of this post, that Goldman was measuring success the wrong way.

How did we get here? The firm changed the way it thought about leadership. Leadership used to be about ideas, setting an example and doing the right thing. Today, if you make enough money for the firm (and are not currently an ax murderer) you will be promoted into a position of influence.

What are three quick ways to become a leader? a) Execute on the firm’s “axes,” which is Goldman-speak for persuading your clients to invest in the stocks or other products that we are trying to get rid of because they are not seen as having a lot of potential profit. b) “Hunt Elephants.” In English: get your clients — some of whom are sophisticated, and some of whom aren’t — to trade whatever will bring the biggest profit to Goldman. Call me old-fashioned, but I don’t like selling my clients a product that is wrong for them. c) Find yourself sitting in a seat where your job is to trade any illiquid, opaque product with a three-letter acronym.

Today, many of these leaders display a Goldman Sachs culture quotient of exactly zero percent. I attend derivatives sales meetings where not one single minute is spent asking questions about how we can help clients. It’s purely about how we can make the most possible money off of them. If you were an alien from Mars and sat in on one of these meetings, you would believe that a client’s success or progress was not part of the thought process at all.

I am sure that Goldman Sachs has sophisticated algorithms to use their giant data sets to predict financial markets and make as much money as possible.  I doubt they’ve ever considered measuring the values of their employees. Sometimes what you measure is a reflection of your values.

- Ravi Iyer

ps. I am not short on projects, but if you would like help taking the data you have and using it to measure intangible/psychological things, feel free to email me.

6Mar/12

Five ways that technology will democratize social science

I currently work as both a researcher at USC and as the Director of Data Science at Ranker.com.  Some people would consider these two roles to be somewhat tangential, but increasingly, I'm finding that there is a lot of overlap.  Technological methods are increasingly of use in social science at the same time as social science methods are being imported into technology companies.  Increasingly, companies are trying to create statistical models to predict behavior.  As more and more data on human behavior and thought is collected by technology companies, as opposed to university researchers, it seems inevitable that social science itself will be changed.

Technology has not just changed, but disrupted, every other dominant form of information distribution that previously existed, be it the distribution of music (iTunes), news (Huffington Post), books (Amazon), TV (Hulu), gossip (Twitter), jokes (Cheezburger), language (c u l8r), family news (Facebook), and education (TED talks or the Khan Academy).  While academia is called the ivory tower for a reason, it seems unlikely that it will escape this wave of change, especially given the fact that the biggest technology companies collect far more data on human thought and behavior in a day than all of academia collects in a year.

Here are five specific ways that I believe technology will change social science:

1. Bigger ecologically valid, data sets - The only thing that separates social science from opinion is the use of data and with more data comes more confident findings.  There is currently some debate in social psychology as to methodology that sometimes can lead to false positive results, by taking advantage of chance.  For example, statistical significance is defined, in many sciences, as something that has a 95% chance of being correct, which sounds impressive, but if 200 researchers want to prove something, this means that 10 of them will be able to, by sheer chance.  As data sets get bigger and bigger, the chance of error will become lower and lower, with standards for "significance" getting more and more stringent.  In addition, most of this new data will be collected in real world environments, meaning that there will be less of a logical leap when inferring some real world phenomena that relates to the results of a lab study.

2.  Cross-sample Validation - With more data comes the possibility of dividing a dataset into many parts (e.g. by referral URL) and replicating research in many datasets.  To do this efficiently will require a technology we use a lot at Ranker, the semantic web.  The semantic web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries.  Right now, researchers cross-validate findings through a painstaking process called meta-analysis, whereby interested parties attempt to reconcile various datasets into a standard format.  Of course, because there are no standards, each reconciliation process is a one-time throwaway process, whereby all the extracted data from one researcher is unusable by subsequent researchers.  Google scholar is starting the process of taking "dumb text" in papers and creating some metadata as my scholar page contains extractions of dates, authors, and citations from papers I've written.  If there were a standard format for describing the data within the papers, there is no reason why those couldn't be extracted as well, allowing us to answer questions like "what is the correlation between openness to experience and ideology in data collected by people before and after 2005?" without having to read all those papers.  It would also let people share simple findings like "the correlation between openness to experience and ideology at place X and time Y is Z," which are completely lost now.

3. Adding inter-disciplinary analysis and variables -  Right now, social science is balkanized.  Every discipline has it's own methodologies and opinions about what is or is not the right way to do things.  Personality psychologists care a lot about measurement while political scientists care about sampling.  Social psychologists create brilliant artificial controlled lab experiments designed to isolate variables, while technology companies mine free form, uncontrolled data seeking exploratory patterns.  Qualitative methods have a richness and depth, that is scoffed at by more quantitative researchers.  All of these methodologies have error and the methodologies of any discipline share error, such that they all would be improved by adding the techniques of other disciplines.  But as long as there are no standards for data (e.g. the semantic web), reconciling this data would require immense human effort.  Further, the lack of standards means that we never have the full picture of human thought and behavior.  Psychologists may study risk tolerance and variable A and financial analysts may study risk tolerance and variable B, which might lead to a natural hypothesis as the relationship between A and B.  But since psychologists are not interested in B and financial analysts do not care about A, nobody reconciles this data.  Real world human behavior usually involves the variables in ALL disciplines, yet each discipline often contents itself with it's own slice of a human being.  Semantic technology will eventually allow us to put these slices together.

4. Systems level approaches - Of course, putting together the results of semantic datasets, which combine hundreds of variables and many bi-directional connections, all with varying degrees of confidence arrived at through various methodological and sampling techniques, is not easy using the traditional paper format.  The end result of such an approach is often a system or a model, of the type that computer scientists build, rather than a paper.  Some psychologists are putting together connectionist models, but the expertise to actually do such things lies in technology circles more than in the social science community.

5. An open knowledge base - The internet hates middlemen, and right now, academic publishers are middlemen who control the flow of information under the outdated idea that people read printed editions of journals devoted to specific limited topics with limited pages.   The noble goal of the editorial process is to separate truth from untruth through peer review, which is a laudable, but completely impractical goal, as evidence exists along a continuum instead of being categorically true or untrue.  There are so many peer-reviewed journals that anything can get the stamp of "truth".  Unlike physics or chemistry, a single paper's worth of evidence, no matter where it appears, is never conclusive in social science.  Big controversies exist in social science even about things where there are tons of very well-done papers about the subject, each of which is ostensibly the truth, or else it shouldn't have been published, right?  The reality of social science is that best we can do is to sum up all the evidence from all the various data collected, hopefully using various methodologies (again, something the semantic web can solve), and get a bigger picture of how robust any finding is.  However, since peer review checks for importance, topicality, novelty, and a host of other subjective factors, not to mention a journal's bias against replications and null findings, the current process actually ends up hiding the true sum of all evidence for any finding.  That is how prominent blatantly false findings can exist in the literature for years undetected.  Further, since journals require high subscription fees from universities (whose employees do all the work for the journal ironically), only people at first world universities can even see this evidence.  Whether you agree with my hypothesis or not, the current system is simply unsustainable given the mountain of data that is coming and the ethos of silicon valley, where publish then review/filter/aggregate is the dominant model.  As more and more data on human behavior and thought is published  by companies like Hunch, Ok Cupid, Ranker and the Facebook data team, the traditional social science system will necessarily adapt to these methods or become largely irrelevant next to these larger, more ecologically valid, robust, and complex datasets.

In summary, social scientists are incredibly smart about what they do, most moreso than I, and there is a lot that technologists can learn from social science methods.  Indeed, on March 11, I'll be giving a talk at SXSW about how much technologists can benefit from social science methods, especially as it relates to serving the intangible needs of employees and customers.

However, there are countless ways that social scientists can benefit from technology as well.  Human beings have been studying the human condition for thousands of years, and the idea that a select group of humans can use their special methodology to go off into an ivory tower, figure things out, and then inform the rest of us what the truth is, is an unlikely scenario.  Or perhaps more correctly, it is a common scenario that has played out throughout history with no actual impact on our collective understanding.   If we really want to make an impact on our collective understanding of ourselves, it will take a collective effort from social scientists and internet professionals, quantitative and qualitative researchers, novelists and political scientists, and including the kid who surveys their 3rd grade class whose data contributes to our collective understanding too.  It is my proposition that technology, and specifically the semantic web, may finally allow such a collaboration to occur.

- Ravi Iyer