Data Science & Psychology Data Science applied to Values, Morals, Politics, & things that matter.

25Apr/13

Big Data Stocks? Invest in Data, not in Tools.

I want to invest in "big data stocks".  After all, everyone is saying that big data is the future of health care, education, government, business, and will literally change the world.  As someone who works with data both as an academic at USC and as the principal data scientist at Ranker, I am the type of person who is likely to make and believe in such hyperbolic claims.  I recently put money into my IRA and needed to invest it and as someone who believes in investing in what I know about, naturally I wanted to invest in our data driven future.

Where should I invest?  If you look around the internet, you'll find a number of recommendations from places like Forbes or The Street.  The general consensus appears to be to take the "picks and shovels" approach to investing in big data, where you invest in the companies that make the tools that enable people to use data, rather than in the data itself.  I'm writing this post because I think this is absolutely the wrong approach.  I believe in investing in data, not in tools.  Why do I believe that?

- My experience in academia has taught me that simple statistics and tools are often the most reliable. If there is signal to be detected, any analysis and/or tool should be able to find it.  Many people turn to more complex statistics when they don't find the right relationship using simple statistics.  In psychology, people are finding that the use of  more complex models (e.g. covariates) is often an indicator that the study's results may be less likely to be reliable.  Given the size of datasets that we often have in data science, we often don't need special statistical techniques to find relationships in data as we have so much statistical power that most tools and techniques should give you convergent results.  Put simply, the tools matter less than the data.

- The most popular tools and techniques are often open source. You can do a lot with R, Python, Gephi, Mahout, etc.

- Yes, there are advantages to using particular distributions of open source tools (e.g. Hadoop distributions that come with particular features), but there are so many companies out there offering different flavors of products that do essentially the same thing, that I can't see how any particular company is going to be the next Apple or Google, in terms of stock growth.  There are no barriers to entry in the tools market.  Perhaps a company will be the next RedHat, which may be a fine business to be in, but I don't believe that that is the revolutionary wave that investors in big data stocks are looking for.

So what should you do if you want to invest in big data?  Buy stock in companies that have the best, biggest, most unique sets of data and/or the most defensible ways of collecting that data.  I invested my IRA money into Facebook, which has the biggest and best dataset of human behavior that ever existed.    I invest my academic time into scalable data collection projects such as YourMorals, BeyondThePurchase, and ExploringMyReligion, confident that that will lead to the most long-term knowledge.  And I invest my professional time into Ranker, which has a scalable process for collecting an opinion graph, that will be essential for the kinds of intelligent applications that big data futurists have been promising us.

Do you want to invest in big data?  Generally, you'll get better returns if you invest your money, time, and energy in data, rather than in tools.

- Ravi Iyer

6Mar/12

Five ways that technology will democratize social science

I currently work as both a researcher at USC and as the Director of Data Science at Ranker.com.  Some people would consider these two roles to be somewhat tangential, but increasingly, I'm finding that there is a lot of overlap.  Technological methods are increasingly of use in social science at the same time as social science methods are being imported into technology companies.  Increasingly, companies are trying to create statistical models to predict behavior.  As more and more data on human behavior and thought is collected by technology companies, as opposed to university researchers, it seems inevitable that social science itself will be changed.

Technology has not just changed, but disrupted, every other dominant form of information distribution that previously existed, be it the distribution of music (iTunes), news (Huffington Post), books (Amazon), TV (Hulu), gossip (Twitter), jokes (Cheezburger), language (c u l8r), family news (Facebook), and education (TED talks or the Khan Academy).  While academia is called the ivory tower for a reason, it seems unlikely that it will escape this wave of change, especially given the fact that the biggest technology companies collect far more data on human thought and behavior in a day than all of academia collects in a year.

Here are five specific ways that I believe technology will change social science:

1. Bigger ecologically valid, data sets - The only thing that separates social science from opinion is the use of data and with more data comes more confident findings.  There is currently some debate in social psychology as to methodology that sometimes can lead to false positive results, by taking advantage of chance.  For example, statistical significance is defined, in many sciences, as something that has a 95% chance of being correct, which sounds impressive, but if 200 researchers want to prove something, this means that 10 of them will be able to, by sheer chance.  As data sets get bigger and bigger, the chance of error will become lower and lower, with standards for "significance" getting more and more stringent.  In addition, most of this new data will be collected in real world environments, meaning that there will be less of a logical leap when inferring some real world phenomena that relates to the results of a lab study.

2.  Cross-sample Validation - With more data comes the possibility of dividing a dataset into many parts (e.g. by referral URL) and replicating research in many datasets.  To do this efficiently will require a technology we use a lot at Ranker, the semantic web.  The semantic web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries.  Right now, researchers cross-validate findings through a painstaking process called meta-analysis, whereby interested parties attempt to reconcile various datasets into a standard format.  Of course, because there are no standards, each reconciliation process is a one-time throwaway process, whereby all the extracted data from one researcher is unusable by subsequent researchers.  Google scholar is starting the process of taking "dumb text" in papers and creating some metadata as my scholar page contains extractions of dates, authors, and citations from papers I've written.  If there were a standard format for describing the data within the papers, there is no reason why those couldn't be extracted as well, allowing us to answer questions like "what is the correlation between openness to experience and ideology in data collected by people before and after 2005?" without having to read all those papers.  It would also let people share simple findings like "the correlation between openness to experience and ideology at place X and time Y is Z," which are completely lost now.

3. Adding inter-disciplinary analysis and variables -  Right now, social science is balkanized.  Every discipline has it's own methodologies and opinions about what is or is not the right way to do things.  Personality psychologists care a lot about measurement while political scientists care about sampling.  Social psychologists create brilliant artificial controlled lab experiments designed to isolate variables, while technology companies mine free form, uncontrolled data seeking exploratory patterns.  Qualitative methods have a richness and depth, that is scoffed at by more quantitative researchers.  All of these methodologies have error and the methodologies of any discipline share error, such that they all would be improved by adding the techniques of other disciplines.  But as long as there are no standards for data (e.g. the semantic web), reconciling this data would require immense human effort.  Further, the lack of standards means that we never have the full picture of human thought and behavior.  Psychologists may study risk tolerance and variable A and financial analysts may study risk tolerance and variable B, which might lead to a natural hypothesis as the relationship between A and B.  But since psychologists are not interested in B and financial analysts do not care about A, nobody reconciles this data.  Real world human behavior usually involves the variables in ALL disciplines, yet each discipline often contents itself with it's own slice of a human being.  Semantic technology will eventually allow us to put these slices together.

4. Systems level approaches - Of course, putting together the results of semantic datasets, which combine hundreds of variables and many bi-directional connections, all with varying degrees of confidence arrived at through various methodological and sampling techniques, is not easy using the traditional paper format.  The end result of such an approach is often a system or a model, of the type that computer scientists build, rather than a paper.  Some psychologists are putting together connectionist models, but the expertise to actually do such things lies in technology circles more than in the social science community.

5. An open knowledge base - The internet hates middlemen, and right now, academic publishers are middlemen who control the flow of information under the outdated idea that people read printed editions of journals devoted to specific limited topics with limited pages.   The noble goal of the editorial process is to separate truth from untruth through peer review, which is a laudable, but completely impractical goal, as evidence exists along a continuum instead of being categorically true or untrue.  There are so many peer-reviewed journals that anything can get the stamp of "truth".  Unlike physics or chemistry, a single paper's worth of evidence, no matter where it appears, is never conclusive in social science.  Big controversies exist in social science even about things where there are tons of very well-done papers about the subject, each of which is ostensibly the truth, or else it shouldn't have been published, right?  The reality of social science is that best we can do is to sum up all the evidence from all the various data collected, hopefully using various methodologies (again, something the semantic web can solve), and get a bigger picture of how robust any finding is.  However, since peer review checks for importance, topicality, novelty, and a host of other subjective factors, not to mention a journal's bias against replications and null findings, the current process actually ends up hiding the true sum of all evidence for any finding.  That is how prominent blatantly false findings can exist in the literature for years undetected.  Further, since journals require high subscription fees from universities (whose employees do all the work for the journal ironically), only people at first world universities can even see this evidence.  Whether you agree with my hypothesis or not, the current system is simply unsustainable given the mountain of data that is coming and the ethos of silicon valley, where publish then review/filter/aggregate is the dominant model.  As more and more data on human behavior and thought is published  by companies like Hunch, Ok Cupid, Ranker and the Facebook data team, the traditional social science system will necessarily adapt to these methods or become largely irrelevant next to these larger, more ecologically valid, robust, and complex datasets.

In summary, social scientists are incredibly smart about what they do, most moreso than I, and there is a lot that technologists can learn from social science methods.  Indeed, on March 11, I'll be giving a talk at SXSW about how much technologists can benefit from social science methods, especially as it relates to serving the intangible needs of employees and customers.

However, there are countless ways that social scientists can benefit from technology as well.  Human beings have been studying the human condition for thousands of years, and the idea that a select group of humans can use their special methodology to go off into an ivory tower, figure things out, and then inform the rest of us what the truth is, is an unlikely scenario.  Or perhaps more correctly, it is a common scenario that has played out throughout history with no actual impact on our collective understanding.   If we really want to make an impact on our collective understanding of ourselves, it will take a collective effort from social scientists and internet professionals, quantitative and qualitative researchers, novelists and political scientists, and including the kid who surveys their 3rd grade class whose data contributes to our collective understanding too.  It is my proposition that technology, and specifically the semantic web, may finally allow such a collaboration to occur.

- Ravi Iyer

15Feb/11

Psychology is generally Continuous, not Categorical

We live in a world where we often have to make categorical decisions.  We date someone or we don't.  We marry them or we don't.  We hire someone or we don't.  We pick either the Democrat or the Republican.  There is no middle ground.

Unfortunately, the world isn't necessarily organized in that fashion.  Few would believe there are such categorical distinctions.  Prospective dates have some degree of positive and negative qualities, rather than attributes being merely present or absent.  Are people either qualified or not for a job?  Most people instead belong along a continuum of professional ability, with some being very qualified (way above being merely adequately qualified) and some people being just below and just above the border of qualification.  Politicians aren't uniformly liberal or conservative and we routinely see partisans on both sides upset at those who aren't extreme enough and who toe the partisan line.

This may seem obvious, but the reason I bring it up now is that while most everyone would agree with this fact, when thought about more carefully, still many people continue to argue as if things are categorical.  There are two recent examples on the yourmorals blog.

First, the comment section of this post has become a debate (for many) over whether psychology is objective (science) or subjective (art).  Allow me to quote Gene, from this thread:

there is SOME objective knowledge that comes from psych research (anything that can be experimentally shown, is predictive, even if only statistically, it has value).

If you want to get really nitty gritty, even physics is not completely “objective”…it’s merely instrumental to understanding objectivity (see here: http://en.wikipedia.org/wiki/Instrumentalism)

Most things are not completely objective or completely subjective, especially where human affect, behavior, and cognition is concerned.  Yes, psychology is less objective than physics...but it's more objective than sculpture.  If I think that Paul McCartney sings better than I do, is that an objective or a subjective fact?  It's objective in so far as a survey of people would detect a very large statistically significant difference between perceptions of our singing.  But it's subjective in so far as it may not be true for a particular person (e.g. my wife and my mom).

What complicates things further is that many people who read psychology don't really care about what happens to most people, but rather how the research applies to them.  Consider this very useful overview of how changing our consumption patterns can make people happier.  One of the recommendations is something that I tell people often, that experiences lead to more happiness than material things, an opinion shared by 57% of a national sample (and shown to be true for most in experimental research).  Yet, 34% of those people disagree (and some don't benefit in experiments).  So is the statement that "buying experiences leads to more happiness than buying things" an objective or a subjective fact?  It's true for a majority of people, but not for a significant minority.  It's likely true for many groups, but certainly not all groups.  Yet many people still think we can definitively decide if psychology is objective or subjective, even though humans, unlike inanimate objects, don't react predictably to situations, except perhaps in aggregate (e.g. we have free will or at least the illusion of it).  I can find truths that apply to all rocks or all electrons, but not for all humans.  But I can find truths that apply to many humans or most humans, and that might give someone insight into themselves, which is a valuable thing.

A second instance of categorical thinking on the yourmorals blog of late is Pigliucci's critique of Haidt's recent SPSP speech.  Haidt pointed out that there is underrepresentation of conservatives in social psychology compared to the population and cites both self-selection and discrimination as issues to varying degrees.  Many people (understandably) focus on the sexier charge of discrimination, and Pigliucci answered that he "suspect(s)  the obvious reason for the “imbalance” of political views in academia is that the low pay, long time before one gets to tenure (if ever), frequent rejection rates from journals and funding agencies, and the necessity to constantly engage one’s critical thinking skills naturally select against conservatives."  But what if causality was continuous and not categorical.  Pigliucci may be entirely right about his obvious reason, yet there still could be some amount of discrimination.  Indeed, if there is one student somewhere whose ideas are supressed (and there was at least one in Haidt's talk), then there is at least some degree of both self-selection and discrimination, meaning that a debate over what statistically causes underrepresentation misses the point.  Bear in mind that these are not just data points, but actual human beings.  One human being discriminated against is one human being we could serve better, even if the vast majority of under-representation is due to self-selection.

I'm obviously biased in the above debate, but these thoughts are not a response to that debate, but rather a response to almost every debate and decision I see in psychology.  Some other things that are continuous, and not categorical:

Journal Publication - Editors have to make categorical decisions to accept or reject papers, yet many papers that are accepted never get cited, while other papers are published through sheer persistence down the chain of  journal prominence.

Statistical Significance - A 94.9% chance of being right is not that different than a 95.1% chance of being right, yet it is treated as a categorical distinction called "significance" because we need to be able to say whether something is true or not, when in reality, all we have is some evidence toward the truth, that varies to some degree.  Even the best paper does not definitively prove anything and even the worst paper is some evidence toward something.

Authorship - Many people work on papers (often undergraduate research assistants) and are not authors, while others do fairly little and receive authorship.  Sometimes the first author does 90% of the work and sometimes they do 51%.  Yet they still receive the categorical distinction of first author.

Psychological conditions - Few psychological clinical conditions are categorical.  In reality, people have some degree of anxiety, rather than having or not having an anxiety disorder.  Yet, for insurance reasons, people have to be diagnosed categorically as having a particular condition.

Psychological constructs - Is shame the same as guilt or different?  Is shame the same as sadness?  Is shame the same as happiness?  The truth is that shame is somewhat like some of these constructs and less like others of these constructs.  Categorical distinctions between such constructs are useful for publications, but don't really reflect the continuous nature of the real world.

I am sure that if I thought more, I could come up with many more examples of things that are continuous, but treated as categorical. In academia, perhaps we can eventually change our systems, leveraging technology, to acknowledge the continuous nature of things.  My real-world hope, as someone who believes that a world with less conflict is better than a world with more conflict, is that perhaps seeing things as continuous, rather than categorical, means that people will be less likely to make harsh judgments of others based on the idea that their beliefs are the categorical caricatures that we make them out to be.

- Ravi Iyer

Tagged as: No Comments
10Feb/11

Can liberal academics study conservative ideology?

Recently, Jon Haidt gave a talk at the main social psychology conference about the statistically impossible lack of diversity in social psychology, meaning that the vast majority of social psychologists are liberal, with a smattering of libertarians or moderates and close to zero self-identified conservatives. This talk was covered in this New York Times article by John Tierney, and it has inspired many social psychologists I know to some degree of introspection about our discipline.  It has also led many who read the article to wonder why there are so many liberals in academia.  Is it a question of discrimination?  Self-selection?

As someone who studies political psychology, I have two main self-serving thoughts.  First, findings in political psychology would support the idea that most of this is due to self-selection.  We know that liberals score higher on measures like openness to experience, challenging the status quo, enjoying effortful thinking, having existential angst (searching for meaning) and placing a value on stimulation.  All of these findings are published and replicated in our YourMorals dataset.  These are all traits that can be framed as positive (enjoying new things, wanting to be an agent of change) and negative (disrespecting tradition, being narcissistic) in the 'real world', but are useful in academia.  Personally, I could be earning more money and likely doing something more objectively useful, but I like the stimulation of working in the world of ideas and it helps ease my existential angst.  This cluster of traits describes some part of most academics I know.

If you see the actual talk (video below), you'll notice that Haidt presumes a fair degree of self-selection and does not set representativeness (e.g. 40% conservatives in the US means we should have 40% in psychology) as a goal, perhaps for this reason.

Still, much of the talk is about discrimination (e.g. the analogy of the closeted homosexual) and so I see why many bloggers might have picked up on the discrimination angle.  I am not saying that there is not some peer pressure exacerbated by the assumption that everyone in the room is liberal...but my experience is that self-selection causes that environment more than the reverse.  That does not mean it isn't a problem.  It is and we should do something about it.

The main problem, from the perspective of someone who wants to understand political attitudes and ideology, is that it's really hard to study something you have no experience with.  Imagine what a collective of non-parents would think of parenting from a completely outside perspective.  Giving up sleep, friends, leisure, and money for an infant that cannot even smile might seem delusional, which is exactly the way that some psychologists see conservative ideology...as a product of some kind of mental fault.  It is only from the inside that sometimes things make more sense.

Those of us who study ideology often have nobody on the inside of conservative movements to help us make sense of them.  It is for that reason that I'd love to see more research conducted by conservatives.  Conservatives don't just have different perspectives on politics, but also in all sorts of other domains.  Until then, I'll have to settle for befriending them wherever I can and plying them with liquor to get their inner thoughts.  As a liberal who wants to persuade conservatives, such understanding is essential, unless I simply want to cheerlead amongst people who already agree with me.

In some ways, it's part of a larger problem in psychology where we ask relatively inexperienced (outside of academia) individuals to theorize about the nature of human experience.  Business school students are expected to have business experience to get into business school, yet social psychologists often have very limited experience with human social life before investigating it.  Given that, is it any wonder that many people feel that memoirs offer as much insight into the human condition as psychology journals?  Having a diverse set of experiences and perspectives within political psychology can only make our work that much more interesting.

- Ravi Iyer

ps. you can read Jon's official piece along with many reactions of other more prominent psychologists on Edge.

1Apr/10

Nate Silver and Veronique de Rugy demonstrate how a more modern peer review process could work.

As someone who was in the dot-com world for years before entering academia, I've always felt that the peer review process could be made far more efficient and while I'm not 100% sure what form that would take, it might look something like a recent exchange between Nate Silver, an Obama supporter who runs fivethirtyeight.com (which I read religiously during the 2008 election and which is the first site I turn to when I seek to interpret polling data), and Veronique de Rugy, an economist with a libertarian bent.

The timeline went something like this...

I imagine that both of them are right now crunching the numbers and figuring out some far more accurate interpretation than either of them would have come up with on their own. The best part is that if I wanted to, I could download the data myself and join in on the fun, perhaps merging in another data source if I so chose. Perhaps someone else is doing that right now too.

I found the exchange so intriguing that I took a break from working on a paper I'm writing about libertarian moral psychology (getting me to take a break actually isn't that hard, unfortunately). When I finish this paper, the timeline is likely to be something like the following:

  • I submit the paper to a journal.
  • 4 Months later - I receive 2-3 reviews of my paper. If they liked it (~30%), I can edit the paper to respond to reviews and move to the next step.  If not, I go back to step 1.
  • 2 Months later - I resubmit the paper.
  • 4 months later - If I'm lucky I may get the paper accepted (~30%), but more likely is that I have to do another round of edits which takes another few months or in rarer cases, the paper is rejected after this stage and I go back to step 1.
  • 2 years later - maybe 50-100 people have read my paper, which now contains an outdated literature review and dated conclusions.  If someone wants to challenge my results, their paper may come out around this time. Few people outside of academia can read my paper due to the need to subscribe to the journal in question. I can't update my paper and have to have a whole new set of findings rather than being able to add a single study or clarification to a part of the existing paper.

Now the process that I described has it's merits. It produces more carefully thought out work, reviewed in depth by experts in the field. It's probably essential in some areas, but it's merits are dependent on the situation and I'm not so sure it's the best method for social science research that is supposed to be used by society in some timely fashion to have positive social benefit. Is that not the real goal of social scientists, rather than CV building?

As Nate Silver points out in his critique of de Rugy's piece, there is inherent unconscious bias that all social scientists encounter when they do any research. Peer reviewers don't reanalyze your data and they rely on your own description of methodology, so they really can't address many possible sources of bias, conscious or unconscious. All research is somewhere between a zero and one in terms of conclusiveness and it only moves close to a one after many people have replicated it, in my opinion, as research is inherently unreliable when you are dealing with people.

What if social scientists all self-published (maybe let's call it sharing rather than publication) on the internet? Overall quality would go down, no doubt. Sharing of replicated results, null findings, and perhaps most importantly, failures to replicate, would probably increase a lot though. Academia would lose a monopoly on research as anyone with a stats program could weigh in and data sharing would become the norm for controversial results. Also, separating the wheat from the chaff is a problem that computer scientists, Google, Digg, Slashdot, and countless others are continually solving. There is tons of research that gets published and then nobody every cites it, so the peer review couldn't have done that well at it's gatekeeping process. What if "getting published" was no longer the standard for acceptability, but rather the number of positive votes/comments of the people who read the article, and you could continually edit and revise your article to make it better, linking to people who replicate your study and updating your literature review and conclusions to keep current. I could envision a post-sharing review system that would actually improve quality by making the review process completely open and transparent, giving extra credit to those whose data has been re-analyzed independently, replicated by others, and read by experts.

There are a million considerations I'm probably leaving out right now, both positive and negative, but given the way that social science data is being generated and the pace the world is moving, it seems unlikely that the peer review process can resist these disruptive forces. Right now, the peer review process confounds sharing research with praising the research in question and maybe there are ways to separate the two goals so that they don't have to happen simultaneously.