Five ways that technology will democratize social science
I currently work as both a researcher at USC and as the Director of Data Science at Ranker.com. Some people would consider these two roles to be somewhat tangential, but increasingly, I’m finding that there is a lot of overlap. Technological methods are increasingly of use in social science at the same time as social science methods are being imported into technology companies. Increasingly, companies are trying to create statistical models to predict behavior. As more and more data on human behavior and thought is collected by technology companies, as opposed to university researchers, it seems inevitable that social science itself will be changed.
Technology has not just changed, but disrupted, every other dominant form of information distribution that previously existed, be it the distribution of music (iTunes), news (Huffington Post), books (Amazon), TV (Hulu), gossip (Twitter), jokes (Cheezburger), language (c u l8r), family news (Facebook), and education (TED talks or the Khan Academy). While academia is called the ivory tower for a reason, it seems unlikely that it will escape this wave of change, especially given the fact that the biggest technology companies collect far more data on human thought and behavior in a day than all of academia collects in a year.
Here are five specific ways that I believe technology will change social science:
1. Bigger ecologically valid, data sets - The only thing that separates social science from opinion is the use of data and with more data comes more confident findings. There is currently some debate in social psychology as to methodology that sometimes can lead to false positive results, by taking advantage of chance. For example, statistical significance is defined, in many sciences, as something that has a 95% chance of being correct, which sounds impressive, but if 200 researchers want to prove something, this means that 10 of them will be able to, by sheer chance. As data sets get bigger and bigger, the chance of error will become lower and lower, with standards for “significance” getting more and more stringent. In addition, most of this new data will be collected in real world environments, meaning that there will be less of a logical leap when inferring some real world phenomena that relates to the results of a lab study.
2. Cross-sample Validation – With more data comes the possibility of dividing a dataset into many parts (e.g. by referral URL) and replicating research in many datasets. To do this efficiently will require a technology we use a lot at Ranker, the semantic web. The semantic web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. Right now, researchers cross-validate findings through a painstaking process called meta-analysis, whereby interested parties attempt to reconcile various datasets into a standard format. Of course, because there are no standards, each reconciliation process is a one-time throwaway process, whereby all the extracted data from one researcher is unusable by subsequent researchers. Google scholar is starting the process of taking “dumb text” in papers and creating some metadata as my scholar page contains extractions of dates, authors, and citations from papers I’ve written. If there were a standard format for describing the data within the papers, there is no reason why those couldn’t be extracted as well, allowing us to answer questions like “what is the correlation between openness to experience and ideology in data collected by people before and after 2005?” without having to read all those papers. It would also let people share simple findings like “the correlation between openness to experience and ideology at place X and time Y is Z,” which are completely lost now.
3. Adding inter-disciplinary analysis and variables – Right now, social science is balkanized. Every discipline has it’s own methodologies and opinions about what is or is not the right way to do things. Personality psychologists care a lot about measurement while political scientists care about sampling. Social psychologists create brilliant artificial controlled lab experiments designed to isolate variables, while technology companies mine free form, uncontrolled data seeking exploratory patterns. Qualitative methods have a richness and depth, that is scoffed at by more quantitative researchers. All of these methodologies have error and the methodologies of any discipline share error, such that they all would be improved by adding the techniques of other disciplines. But as long as there are no standards for data (e.g. the semantic web), reconciling this data would require immense human effort. Further, the lack of standards means that we never have the full picture of human thought and behavior. Psychologists may study risk tolerance and variable A and financial analysts may study risk tolerance and variable B, which might lead to a natural hypothesis as the relationship between A and B. But since psychologists are not interested in B and financial analysts do not care about A, nobody reconciles this data. Real world human behavior usually involves the variables in ALL disciplines, yet each discipline often contents itself with it’s own slice of a human being. Semantic technology will eventually allow us to put these slices together.
4. Systems level approaches – Of course, putting together the results of semantic datasets, which combine hundreds of variables and many bi-directional connections, all with varying degrees of confidence arrived at through various methodological and sampling techniques, is not easy using the traditional paper format. The end result of such an approach is often a system or a model, of the type that computer scientists build, rather than a paper. Some psychologists are putting together connectionist models, but the expertise to actually do such things lies in technology circles more than in the social science community.
5. An open knowledge base - The internet hates middlemen, and right now, academic publishers are middlemen who control the flow of information under the outdated idea that people read printed editions of journals devoted to specific limited topics with limited pages. The noble goal of the editorial process is to separate truth from untruth through peer review, which is a laudable, but completely impractical goal, as evidence exists along a continuum instead of being categorically true or untrue. There are so many peer-reviewed journals that anything can get the stamp of “truth”. Unlike physics or chemistry, a single paper’s worth of evidence, no matter where it appears, is never conclusive in social science. Big controversies exist in social science even about things where there are tons of very well-done papers about the subject, each of which is ostensibly the truth, or else it shouldn’t have been published, right? The reality of social science is that best we can do is to sum up all the evidence from all the various data collected, hopefully using various methodologies (again, something the semantic web can solve), and get a bigger picture of how robust any finding is. However, since peer review checks for importance, topicality, novelty, and a host of other subjective factors, not to mention a journal’s bias against replications and null findings, the current process actually ends up hiding the true sum of all evidence for any finding. That is how prominent blatantly false findings can exist in the literature for years undetected. Further, since journals require high subscription fees from universities (whose employees do all the work for the journal ironically), only people at first world universities can even see this evidence. Whether you agree with my hypothesis or not, the current system is simply unsustainable given the mountain of data that is coming and the ethos of silicon valley, where publish then review/filter/aggregate is the dominant model. As more and more data on human behavior and thought is published by companies like Hunch, Ok Cupid, Ranker and the Facebook data team, the traditional social science system will necessarily adapt to these methods or become largely irrelevant next to these larger, more ecologically valid, robust, and complex datasets.
In summary, social scientists are incredibly smart about what they do, most moreso than I, and there is a lot that technologists can learn from social science methods. Indeed, on March 11, I’ll be giving a talk at SXSW about how much technologists can benefit from social science methods, especially as it relates to serving the intangible needs of employees and customers.
However, there are countless ways that social scientists can benefit from technology as well. Human beings have been studying the human condition for thousands of years, and the idea that a select group of humans can use their special methodology to go off into an ivory tower, figure things out, and then inform the rest of us what the truth is, is an unlikely scenario. Or perhaps more correctly, it is a common scenario that has played out throughout history with no actual impact on our collective understanding. If we really want to make an impact on our collective understanding of ourselves, it will take a collective effort from social scientists and internet professionals, quantitative and qualitative researchers, novelists and political scientists, and including the kid who surveys their 3rd grade class whose data contributes to our collective understanding too. It is my proposition that technology, and specifically the semantic web, may finally allow such a collaboration to occur.
- Ravi Iyer