6Mar/12
I currently work as both a researcher at USC and as the Director of Data Science at Ranker.com. Some people would consider these two roles to be somewhat tangential, but increasingly, I'm finding that there is a lot of overlap. Technological methods are increasingly of use in social science at the same time as social science methods are being imported into technology companies. Increasingly, companies are trying to create statistical models to predict behavior. As more and more data on human behavior and thought is collected by technology companies, as opposed to university researchers, it seems inevitable that social science itself will be changed.
Technology has not just changed, but disrupted, every other dominant form of information distribution that previously existed, be it the distribution of music (iTunes), news (Huffington Post), books (Amazon), TV (Hulu), gossip (Twitter), jokes (Cheezburger), language (c u l8r), family news (Facebook), and education (TED talks or the Khan Academy). While academia is called the ivory tower for a reason, it seems unlikely that it will escape this wave of change, especially given the fact that the biggest technology companies collect far more data on human thought and behavior in a day than all of academia collects in a year.
Here are five specific ways that I believe technology will change social science:
1. Bigger ecologically valid, data sets - The only thing that separates social science from opinion is the use of data and with more data comes more confident findings. There is currently some debate in social psychology as to methodology that sometimes can lead to false positive results, by taking advantage of chance. For example, statistical significance is defined, in many sciences, as something that has a 95% chance of being correct, which sounds impressive, but if 200 researchers want to prove something, this means that 10 of them will be able to, by sheer chance. As data sets get bigger and bigger, the chance of error will become lower and lower, with standards for "significance" getting more and more stringent. In addition, most of this new data will be collected in real world environments, meaning that there will be less of a logical leap when inferring some real world phenomena that relates to the results of a lab study.
2. Cross-sample Validation - With more data comes the possibility of dividing a dataset into many parts (e.g. by referral URL) and replicating research in many datasets. To do this efficiently will require a technology we use a lot at Ranker, the semantic web. The semantic web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. Right now, researchers cross-validate findings through a painstaking process called meta-analysis, whereby interested parties attempt to reconcile various datasets into a standard format. Of course, because there are no standards, each reconciliation process is a one-time throwaway process, whereby all the extracted data from one researcher is unusable by subsequent researchers. Google scholar is starting the process of taking "dumb text" in papers and creating some metadata as my scholar page contains extractions of dates, authors, and citations from papers I've written. If there were a standard format for describing the data within the papers, there is no reason why those couldn't be extracted as well, allowing us to answer questions like "what is the correlation between openness to experience and ideology in data collected by people before and after 2005?" without having to read all those papers. It would also let people share simple findings like "the correlation between openness to experience and ideology at place X and time Y is Z," which are completely lost now.
3. Adding inter-disciplinary analysis and variables - Right now, social science is balkanized. Every discipline has it's own methodologies and opinions about what is or is not the right way to do things. Personality psychologists care a lot about measurement while political scientists care about sampling. Social psychologists create brilliant artificial controlled lab experiments designed to isolate variables, while technology companies mine free form, uncontrolled data seeking exploratory patterns. Qualitative methods have a richness and depth, that is scoffed at by more quantitative researchers. All of these methodologies have error and the methodologies of any discipline share error, such that they all would be improved by adding the techniques of other disciplines. But as long as there are no standards for data (e.g. the semantic web), reconciling this data would require immense human effort. Further, the lack of standards means that we never have the full picture of human thought and behavior. Psychologists may study risk tolerance and variable A and financial analysts may study risk tolerance and variable B, which might lead to a natural hypothesis as the relationship between A and B. But since psychologists are not interested in B and financial analysts do not care about A, nobody reconciles this data. Real world human behavior usually involves the variables in ALL disciplines, yet each discipline often contents itself with it's own slice of a human being. Semantic technology will eventually allow us to put these slices together.
4. Systems level approaches - Of course, putting together the results of semantic datasets, which combine hundreds of variables and many bi-directional connections, all with varying degrees of confidence arrived at through various methodological and sampling techniques, is not easy using the traditional paper format. The end result of such an approach is often a system or a model, of the type that computer scientists build, rather than a paper. Some psychologists are putting together connectionist models, but the expertise to actually do such things lies in technology circles more than in the social science community.
5. An open knowledge base - The internet hates middlemen, and right now, academic publishers are middlemen who control the flow of information under the outdated idea that people read printed editions of journals devoted to specific limited topics with limited pages. The noble goal of the editorial process is to separate truth from untruth through peer review, which is a laudable, but completely impractical goal, as evidence exists along a continuum instead of being categorically true or untrue. There are so many peer-reviewed journals that anything can get the stamp of "truth". Unlike physics or chemistry, a single paper's worth of evidence, no matter where it appears, is never conclusive in social science. Big controversies exist in social science even about things where there are tons of very well-done papers about the subject, each of which is ostensibly the truth, or else it shouldn't have been published, right? The reality of social science is that best we can do is to sum up all the evidence from all the various data collected, hopefully using various methodologies (again, something the semantic web can solve), and get a bigger picture of how robust any finding is. However, since peer review checks for importance, topicality, novelty, and a host of other subjective factors, not to mention a journal's bias against replications and null findings, the current process actually ends up hiding the true sum of all evidence for any finding. That is how prominent blatantly false findings can exist in the literature for years undetected. Further, since journals require high subscription fees from universities (whose employees do all the work for the journal ironically), only people at first world universities can even see this evidence. Whether you agree with my hypothesis or not, the current system is simply unsustainable given the mountain of data that is coming and the ethos of silicon valley, where publish then review/filter/aggregate is the dominant model. As more and more data on human behavior and thought is published by companies like Hunch, Ok Cupid, Ranker and the Facebook data team, the traditional social science system will necessarily adapt to these methods or become largely irrelevant next to these larger, more ecologically valid, robust, and complex datasets.
In summary, social scientists are incredibly smart about what they do, most moreso than I, and there is a lot that technologists can learn from social science methods. Indeed, on March 11, I'll be giving a talk at SXSW about how much technologists can benefit from social science methods, especially as it relates to serving the intangible needs of employees and customers.
However, there are countless ways that social scientists can benefit from technology as well. Human beings have been studying the human condition for thousands of years, and the idea that a select group of humans can use their special methodology to go off into an ivory tower, figure things out, and then inform the rest of us what the truth is, is an unlikely scenario. Or perhaps more correctly, it is a common scenario that has played out throughout history with no actual impact on our collective understanding. If we really want to make an impact on our collective understanding of ourselves, it will take a collective effort from social scientists and internet professionals, quantitative and qualitative researchers, novelists and political scientists, and including the kid who surveys their 3rd grade class whose data contributes to our collective understanding too. It is my proposition that technology, and specifically the semantic web, may finally allow such a collaboration to occur.
- Ravi Iyer
19Apr/10
I have recently been following a discussion in my discipline about the peer review process, which led me to this very interesting paper about the history of and alternatives to the peer review process in psychology.
At the same time, I've been working with colleagues on a paper about experiential vs. material purchasing styles, for which we have found convergent correlations all suggesting that experiential purchasers are dispositionally motivated towards seeking new, stimulating experiences to promote positive emotion, while material purchasers often seek to avoid negative emotions. This is supported by the fact that, in the YourMorals.org dataset, experiential purchasers report higher levels of openness to experience, lower levels of neuroticism (both measured by the Big Five Personality Inventory), and lower levels of disgust (as measured by the Disgust Scale). The disgust finding does not necessarily fit with the idea that experiential purchasing is related to seeking new experiences, unless one looks at the literature on disgust. In particular, this study theorized about such a relationship and confirmed it by reporting correlations between disgust and big five personality dimensions.
It occurred to me that I could contribute to the original studies' findings, by examining the same correlations in our dataset, using a more diverse and far larger sample, and perhaps even including some internal cross-validation. The results are summarized in the table below.

Disgust Scale Correlations with Big Five Personality Traits
The main hypothesis of the original study actually dealt with the two robust relationships found in our dataset, specifically that disgust is negatively related to openness to experience and positively related to neuroticism. In all, these two relationships stand out as robust across groups and in both studies. Interestingly, the correlation between openness to experience and disgust is weaker in the two most 'rational' groups, edge.org and libertarians, which might be worth pursuing later. Given the smaller sample size and restricted diversity of the original study, I'd be inclined to say that conscientiousness and agreeableness are not robust correlates of disgust, though this could be an effect of the fact that yourmorals.org uses a different measures of Big Five personality traits from the original study.
Can I publish this finding? It's only correlational and says nothing about causality. It really doesn't say much that is new, but rather confirms the original study, more or less. Still, the 26 papers which cited the original study would be slightly more improved if they could cite this finding as well, since it's the same basic study with a different (larger and more diverse) sample. This is where the discussion of the peer review system converges with this analysis. According to this paper, "many natural science fields operate on a norm that submissions should be accepted unless they are patently wrong." In contrast, psychology papers are often rejected, not because they are wrong, but because they are not interesting or novel enough.
The paper and the listserve discussion bring up many points related to this, but one relevant one to this finding is that it is hard to build a cumulative science when you don't reward replication, but instead reward novelty. The end result is that you end up with a series of slightly different perspectives on the same subjects, all named differently, where authors are constantly trying to come up with something new rather than building on something existing. This may help academics, but it makes it very difficult for these theories to be used in the real world. Any research on humans is likely flawed in some way. Can anybody do double-blind experiments on representative samples of people with behavioral measures? The public is wisely skeptical of any social science finding as are academics...but the solution might lie in publishing more replications rather than in restricting the publication process toward the mythical goal of the perfect, novel study. No single study proves anything when dealing with research on people. It's the convergence of lots of studies that might potentially be convincing enough to outsiders.
- Ravi Iyer
ps. if anyone wants to write this up and publish it traditionally, feel free to contact me
1Apr/10
As someone who was in the dot-com world for years before entering academia, I've always felt that the peer review process could be made far more efficient and while I'm not 100% sure what form that would take, it might look something like a recent exchange between Nate Silver, an Obama supporter who runs fivethirtyeight.com (which I read religiously during the 2008 election and which is the first site I turn to when I seek to interpret polling data), and Veronique de Rugy, an economist with a libertarian bent.
The timeline went something like this...
- March 2010 - de Rugy publishes a paper alleging that Democratic districts received more money than Republican districts from stimulus funds.
- April 1, 2010 @ 11am - Silver challenges her assumption in that she failed to take into account the fact that the districts receiving the most funds were state capitols, which ostensibly were supposed to send funds onwards.
- April 1, 2010 @ 4:42pm - de Rugy shares her data, concedes some points (including the need to check for capitols), while giving explanations for other points and maintaining her larger finding and taking some offense for being accused of bias.
- April 1, 2010 @ 7:35pm - Silver responds to her response, praising de Rugy for her openness, tempering his accusation of bias as the sort of unconscious bias that all social scientists have, and perhaps finding a middle ground in conceding that there may be some unconscious bias effects or particular project effects which account for her initial finding, which may or may not survive the inclusion of state capitol-hood as a controlling variable.
I imagine that both of them are right now crunching the numbers and figuring out some far more accurate interpretation than either of them would have come up with on their own. The best part is that if I wanted to, I could download the data myself and join in on the fun, perhaps merging in another data source if I so chose. Perhaps someone else is doing that right now too.
I found the exchange so intriguing that I took a break from working on a paper I'm writing about libertarian moral psychology (getting me to take a break actually isn't that hard, unfortunately). When I finish this paper, the timeline is likely to be something like the following:
- I submit the paper to a journal.
- 4 Months later - I receive 2-3 reviews of my paper. If they liked it (~30%), I can edit the paper to respond to reviews and move to the next step. If not, I go back to step 1.
- 2 Months later - I resubmit the paper.
- 4 months later - If I'm lucky I may get the paper accepted (~30%), but more likely is that I have to do another round of edits which takes another few months or in rarer cases, the paper is rejected after this stage and I go back to step 1.
- 2 years later - maybe 50-100 people have read my paper, which now contains an outdated literature review and dated conclusions. If someone wants to challenge my results, their paper may come out around this time. Few people outside of academia can read my paper due to the need to subscribe to the journal in question. I can't update my paper and have to have a whole new set of findings rather than being able to add a single study or clarification to a part of the existing paper.
Now the process that I described has it's merits. It produces more carefully thought out work, reviewed in depth by experts in the field. It's probably essential in some areas, but it's merits are dependent on the situation and I'm not so sure it's the best method for social science research that is supposed to be used by society in some timely fashion to have positive social benefit. Is that not the real goal of social scientists, rather than CV building?
As Nate Silver points out in his critique of de Rugy's piece, there is inherent unconscious bias that all social scientists encounter when they do any research. Peer reviewers don't reanalyze your data and they rely on your own description of methodology, so they really can't address many possible sources of bias, conscious or unconscious. All research is somewhere between a zero and one in terms of conclusiveness and it only moves close to a one after many people have replicated it, in my opinion, as research is inherently unreliable when you are dealing with people.
What if social scientists all self-published (maybe let's call it sharing rather than publication) on the internet? Overall quality would go down, no doubt. Sharing of replicated results, null findings, and perhaps most importantly, failures to replicate, would probably increase a lot though. Academia would lose a monopoly on research as anyone with a stats program could weigh in and data sharing would become the norm for controversial results. Also, separating the wheat from the chaff is a problem that computer scientists, Google, Digg, Slashdot, and countless others are continually solving. There is tons of research that gets published and then nobody every cites it, so the peer review couldn't have done that well at it's gatekeeping process. What if "getting published" was no longer the standard for acceptability, but rather the number of positive votes/comments of the people who read the article, and you could continually edit and revise your article to make it better, linking to people who replicate your study and updating your literature review and conclusions to keep current. I could envision a post-sharing review system that would actually improve quality by making the review process completely open and transparent, giving extra credit to those whose data has been re-analyzed independently, replicated by others, and read by experts.
There are a million considerations I'm probably leaving out right now, both positive and negative, but given the way that social science data is being generated and the pace the world is moving, it seems unlikely that the peer review process can resist these disruptive forces. Right now, the peer review process confounds sharing research with praising the research in question and maybe there are ways to separate the two goals so that they don't have to happen simultaneously.