What Psychologists Could Learn From Data Science About Exporatory Research
I recently attended the main conference for social psychologists, even as I’m slowly transitioning to think of myself less as an academic and more as a data scientist. Of course, the term data science is a pretty poor term as all science has to do with data, but I think it serves a purpose in that there are methods for answering questions with data that operate across the domain where the data was collected. There is no real reason why a person well trained in understanding and analyzing data can’t apply their techniques on medical data, sports data, psychological data, and online data. In fact, research on the wisdom of crowds would suggest that any discipline would benefit from analyzing data in different ways as colleagues are likely to make correlated errors concerning understanding anything. This is certainly true in social psychology, where a common error that has been made is the under-valuing of exploratory research.
To our credit, social psychologists are beginning to understand this. Many years after Paul Rozin formally published a great article concerning the need for more diverse ways of researching questions, psychologists are starting to accept the idea that exploratory research has value alongside the experimental methods that are so popular. Below is a picture from one of several such talks given.
It’s great that psychologists are willing to consider exploratory approaches. However, I don’t think we necessarily need to pretend like we are starting from scratch. It seems like many psychologists want to simply let people fiddle with data in the haphazard ways they have been doing, label it exploratory, and then get on with “real” (confirmatory) research. This is an area where data science, with it’s emphasis on how to automatically, efficiently extract well-supported insights from large datasets, has a big head start. What can data science offer psychologists?
- More efficient exploration. Running haphazard regressions til you find a good model is inefficient for a number of reasons. It takes a lot of human effort and then when you do find something, you have no real way to reproduce the algorithm that you used to find the result you did on a subsequent dataset. To put it in more practical terms, every psychologist who wants to run exploratory regressions should at least understand GLMnet (details of which I’ll put in a future post).
- Cross-validated exploration. Data scientists have given a lot of thought to questions of how to be more sure that a result is true, when one is testing so many hypotheses that one is bound to find something by chance. Cross-validation is not a cure-all, but then again, nor are relatively artificial lab studies. Certainly a cross-validated exploratory finding is more likely to be true than a non-cross-validated exploratory finding. Broadly, just as some experiments are greater evidence than less well-designed experiments, so too are some exploratory findings greater evidence than other explorations. Of course, this last sentence will completely confound those who insist that publications can only publish “true” findings that are supported by p<.05 statistics, which leads me to my last point.
- Bayesian models of findings. There was a ton of talk about the problem of false positives, but the entrenched interests of the journal system (IMHO) inhibit the paradigm shift that is needed, which is to think of findings and papers as evidence as opposed to truth. Good publications are not true…they are merely stronger evidence. And rejected publications are rarely worthless. Rather, they may be weaker evidence or may not affect prior beliefs to quite the same degree. Setting a high bar for publication is great for creating a tournament for job seekers. But it’s a terrible way to find truth in an age where data and research is ubiquitous. If you want to read a more detailed argument about this, I’d read Nate Silver’s Book.
There are some things that social psychologists are really good at. They understand experimental methods and can critique them really well. They understand measurement much better than most disciplines. But there are some things that other disciplines do much better with data, such as exploration. The banner of data science presents the opportunity to break down these barriers, so that the social psychologist can help the Google engineer design the perfect study to validate the results of their latest machine learning algorithm, while the political scientist helps the social psychologist with representative sampling and the Google engineer helps the political scientist explore the latest national survey in a far more efficient way and then mash up that data with more ecologically valid social media behavior. And so, the end result is that there really isn’t a huge need for disciplinarity in an age of big data (which was a theme of Jamie Pennebaker’s presidential address at SPSP). It actually gets in the way of us all being data scientists.
- Ravi Iyer