Did you know that Ranker is one of the top 100 web destinations for mobile per Quantcast, ahead of household names like The Onion and People magazine? We are ranked #520 in the non-mobile world. Why do we do better with mobile users as opposed to people using a desktop computer? I’ve made this argument for awhile, but I’m hardly an authority, so I was heartened to see Google making a similar argument.
This embrace of mobile computing impacts search behavior in a number of important ways.
First, it makes the process of refining search queries much more tiresome. …While refining queries is never a great user experience, on a mobile device (and particularly on a mobile phone) it is especially onerous. This has provided the search engines with a compelling incentive to ensure that the right search results are delivered to users on the first go, freeing them of laborious refinements.
Second, the process of navigating to web pages (is) a royal pain on a hand-held mobile device.
This situation provides a compelling incentive for the search engines to circumvent additional web page visits altogether, and instead present answers to queries – especially straightforward informational queries – directly in the search results. While many in the search marketing field have suggested that the search engines have increasingly introduced direct answers in the search results to rob publishers of clicks, there’s more than a trivial case to be made that this is in the best interest of mobile users. Is it really a good thing to compel an iPhone user to browse to a web page – which may or may not be optimized for mobile – and wait for it to load in order to learn the height of the Eiffel Tower?
As a result, if you ask your mobile phone for the height of a famous building (Taipei 101 in the below case), it doesn’t direct you to a web page. Instead it answers the question itself.
That’s great for a question that has a single answer, but an increasing number of searches are not for objective facts with a single answer, but rather for subjective opinions where a ranked list is the best result. Consider the below chart showing the increase in searches for the term “best”. A similar pattern can be found for most any adjective.
So if consumers are increasingly doing searches on mobile phones, requiring a concise list of potential answers to questions with more than one answer, they naturally are going to end up at sites which have ranked lists…like Ranker. As such, a lot of Ranker’s future growth is likely to parallel the growth of mobile and the growth of searches for opinion based questions.
- Ravi Iyer
Very few days go by without a new article describing the limits of published scientific research. The headline cases are about scientists who plagarize or completely fabricate data. Yet, in my experience, most scientists are actually quite ethical, meticulous, hard-working, and really concerned with finding the truth. Still, non-scientists would likely be surprised to know that a large number of scientific studies are actually false. An Amgen study found that 46 out of 53 studies with 'landmark' findings were unable to be replicated. A team at Bayer found a slightly more optimistic picture where 43 out of 65 studies revealed inconsistencies when tested independently. Scientific journals continue to accept articles based on the novelty and projected impact of the submission, yet simulations illustrate how the bias of journals toward publishing novel results likely leads to an environment where most published results are actually false. My home discipline of psychology is currently doing some soul searching as it's a relatively open secret that many results are difficult to reproduce such that a systematic reproducibility project is taking place.
Crowdsourcing is, and always has been the solution. Indeed, the phrase at the bottom of Google Scholar, "standing on the shoulders of giants", acknowledges that science has always been about crowdsourcing, as every scholar is collaborating with the scholars before them. Findings are not produced in a vacuum and build upon (or challenge) previous findings. Replication by others, which effectively crowdsources verification of results, is at the heart of the scientific method. It is perhaps a sign of the narcissism of our age that scientists feel compelled to believe that they discover things largely independently, such that they feel compelled to attack when their findings are challenged. Yet a willingness to be wrong about something is essential to learning, as we can't learn to walk without falling or learn about relationships without heartbreak. When science becomes more about ego, career, and grant money, it naturally becomes less accurate. Insisting that findings be crowdsourced solves this. No single study, paper, or research group can prove anything by themselves.
Crowdsourcing is not simply averaging the opinions of the masses, as those who would argue against that straw man would have you believe. Mathematically, crowdsourcing is about reducing the influence of sources of error and there is a great deal of academic research on this topic. A good crowdsourcing algorithm does not weight all inputs equally, but instead seeks to identify clustered sources of error, which explains why aggregating across people with diverse personalities, perspectives or job functions produces better results. Inputs need to have some signal vs. noise and need to have uncorrelated error. The unfortunate assumption in most research is that error is uncorrelated statistical noise that can be dealt with using statistical tests. Yet error also occurs due to the unconscious biases of researchers, the sheer number of researchers trying to find novel findings, the degrees of freedom that a researcher has in trying to prove their hypothesis, the non-randomness of sampling, and the volume of available statistical tests that a researcher can use. Given all these other sources of error, it is no wonder that many findings are false. A good crowdsourcing algorithm would be weighted such that true results would have to be shown by multiple researchers using multiple methods, multiple samples, multiple statistical tests, and multiple paradigms. This requires crowdsourcing as no single person can do all this, and even if they could, they would still represent a single source of error.
Technology enables crowdsourcing to be conducted far more efficiently, as has been proven by successful science crowdsourcing projects like GalaxyZoo, FoldIt, Seti@Home, and psychology's reproducibility project. Trends like citizen science, the quantified self, open access publishing, and interdisciplinarity improve the diversity of perspectives which mathematically improves the ability to find truth. Every meta-analysis result and Nate Silver's success in aggregating polls in the last election take advantage of the mathematical principles that underlie crowdsourcing, specifically the certainty that aggregating across sources of error produces more truth. In our daily lives, we all crowdsource knowledge that we are uncertain about, looking for confirmation from multiple independent sources when we are skeptical. This same skepticism serves scientists well and scientists should embrace being wrong, confident that the broader truth will be revealed when all data is aggregated intelligently and all perspectives are valued. Crowdsourcing is not some new technique that threatens to fundamentally change scientific research. Rather, it is an extension of the collective effort of knowledge aggregation that is the heart of science and scientists should embrace it as such.
- Ravi Iyer
Part of my job at Ranker is to talk to other companies about our data. While people often talk about how "big data" is revolutionizing everything, the reality of the data marketplace is that it still largely revolves around sales, marketing, and advertising. Huge infrastructures exist to make sure that the most optimal ad for the right product gets to the right person, leveraging as much data as possible. For example, I recently presented at a data conference at the Westin St. Francis in San Francisco, which meant that I spent some time on their website. For the past few weeks, long after the conference, I've been getting ads specifically for the Westin St. Francis on various websites. At some level, this is an impressive use of data, but at another level, it's a failure, as I'm no longer in the market for a hotel room. The data to solve this problem is out there as someone could have tracked my visitation of the conference website, understood the date of the conference, and better understood my intent in visiting the Westin. However, this level of analysis doesn't scale well for an ad that costs pennies, and so nobody does this level of behavioral targeting.
I bring up this story because I believe this illustrates a difference between how people who think of themselves as businesspeople and people who think of themselves as technologists often think. When talking about Ranker data, I often see this dichotomy. People who are more traditionally business minded want a clear business reason to use data, while people who think of themselves as technologists seem more open to trying to envision a world where data does all sorts of neat things that data should be used for. For example, I recently graphed opinions about beer, illustrating that Miller Lite drinkers were closer to Guinness drinkers than to Chimay drinkers. As a technologist, I'm certain that a world will soon exist where bartenders can use data about me and others like me (e.g. the beer graph), to recommend a beer. I don't worry as much about the immediate path from the conception of such data to monetization. I know that the beer graph should exist and I'm happy to help contribute to it, confident of my vision of the future.
This division between people who think like businesspeople and people who think like technologists is important for anyone who does business development or business to business sales, especially for those of us in the technology world where the lines are often blurry. Mark Zuckerberg is a CEO, but clearly he thinks like a technologist. My guess is that a lot of the CTOs of big companies actually think more like businesspeople than technologists. If I were trying to sell Mark Zuckerberg on something, I would try to sell him on how whatever I was offering could make a huge difference to something he cared about. I would sell the dream. But if I were selling a more traditional businessperson, I would try to sell the benefits versus the costs. I would have a detailed plan and sell the details.
I actually have a bit of data from YourMorals.org to support this assertion. We have started collecting data on visitors' professions and below I compare businesspeople to technologists on two of the Big Five personality dimensions that are said to underlie much of personality: Conscientiousness and Openness to Experience. As you can see, businesspeople are more conscientious (detail oriented, fastidious, responsible), while technologists score higher on openness which is indicative of enjoying exploring new ideas and thinking of new possibilities.
The reality is that every business needs a balance between those who are detail oriented and precise (Conscientious) and those who think about a vision for the future (Openness to Experience). Often, technologists who start a company will eventually hire professional businesspeople who provide this balance (e.g. Sheryl Sandberg or Eric Schmidt). Clearly, the best sales pitch will be both detailed and forward thinking. However, if you're talking to someone and have limited time and attention, considering whether you are speaking to someone who is more of a businessperson or more of a technologist may give you better insight into how to frame your pitch.
- Ravi Iyer
One of the strengths of Ranker‘s data is that we collect such a wide variety of opinions from users that we can put opinions about a wide variety of subjects into a graph format. Graphs are useful as they let you go beyond the individual relationships between items and see overall patterns. In anticipation of Cinco de Mayo, I produced the below opinion graph of beers, based on votes on lists such as our Best World Beers list. Connections in this graph represent significant correlations between sentiment towards connected beers, which vary in terms of strength. A layout algorithm (force atlas in Gephi) placed beers that were more related closer to each other and beers that had fewer/weaker connections further apart. I also ran a classification algorithm that clustered beers according to preference and colored the graph according to these clusters. Click on the below graph to expand it.
One of the fun things about graphs is that different people will see different patterns. Among the things I learned from this exercise are:
- The opposite of light beer, from a taste perspective, isn’t dark beer. Rather, light beers like Miller Lite are most opposite craft beers like Stone IPA and Chimay.
- Coors light is the light beer that is closest to the mainstream cluster. Stella Artois, Corona, and Heineken are also reasonable bridge beers between the main cluster and the light beer world.
- The classification algorithm revealed six main taste/opinion clusters, which I would label: Really Light Beers (e.g. Natural Light), Lighter Mainstream Beers (e.g. Blue Moon), Stout Beers (e.g. Guinness), Craft Beers (e.g. Stone IPA), Darker European Beers (e.g. Chimay), and Lighter European Beers (e.g. Leffe Blonde). The interesting parts about the classifications are the cases on the edge, such as how Newcastle Brown Ale appeals to both Guinness and Heineken drinkers.
- Seeing beers graphed according to opinions made me wonder if companies consciously position their beers accordingly. Is Pyramid Hefeweizen successfully appealing to the Sam Adams drinker who wants a bit of European flavor? Is Anchor Steam supposed to appeal to both the Guinness drinker and the craft beer drinker? I’m not sure if I know enough about the marketing of beers to know the answer to this, but I’d be curious if beer companies place their beers in the same space that this opinion graph does.
These are just a few observations based on my own limited beer drinking experience. I tend to be more of a whiskey drinker, and hope more of you will vote on our Best Tasting Whiskey list, so I can graph that next. I’d love to hear comments about other observations that you might make from this graph.
- Ravi Iyer
Incoming search terms:
- beer graphs
- clustering graph
- beer classifications
- tripadvisor statistic success
- pop culture data
- pattern based classification on graphs graph
- opinions about beverages
- opinion graph
- is blue moon beer mainstream
- graph of opinions about the world\s beers
NYU, USC, UCLA, Yale, Julliard, Columbia, and Harvard top the Rankings.
Does USC or NYU have a better film school? “Big data” can provide an answer to this question by linking data about movies and the actors, directors, and producers who have worked on specific movies, to data about universities and the graduates of those universities. As such, one can use semantic data from sources like Freebase, DBPedia, and IMDB to figure out which schools have produced the most working graduates. However, what if you cared about the quality of the movies they worked on rather than just the quantity? Educating a student who went on to work on The Godfather must certainly be worth more than producing a student who received a credit on Gigli.
Leveraging opinion data from Ranker’s Best Movies of All-Time list in addition to widely available semantic data, Ranker recently produced a ranked list of the world’s 25 best film schools, based on credits on movies within the top 500 movies of all-time. USC produces the most film credits by graduates overall, but when film quality is taken into account, NYU (208 credits) actually produces more credits among the top 500 movies of all-time, compared to USC (186 credits). UCLA, Yale, Julliard, Columbia, and Harvard take places 3 through 7 on the Ranker’s list. Several professional schools that focus on the arts also place in the top 25 (e.g. London’s Royal Academy of Dramatic Art) as well as some well-located high schools (New York’s Fiorello H. Laguardia High School & Beverly Hills High School).
The World’s Top 25 Film Schools
- New York University (208 credits)
- University of Southern California (186 credits)
- University of California – Los Angeles (165 credits)
- Yale University (110 credits)
- Julliard School (106 credits)
- Columbia University (100 credits)
- Harvard University (90 credits)
- Royal Academy of Dramatic Art (86 credits)
- Fiorello H. Laguardia High School of Music & Art (64 credits)
- American Academy of Dramatic Arts (51 credits)
- London Academy of Music and Dramatic Art (51 credits)
- Stanford University (50 credits)
- HB Studio (49 credits)
- Northwestern University (47 credits)
- The Actors Studio (44 credits)
- Brown University (43 credits)
- University of Texas – Austin (40 credits)
- Central School of Speech and Drama (39 credits)
- Cornell University (39 credits)
- Guildhall School of Music and Drama (38 credits)
- University of California – Berkeley (38 credits)
- California Institute of the Arts (38 credits)
- University of Michigan (37 credits)
- Beverly Hills High School (36 credits)
- Boston University (35 credits)
“Clearly, there is a huge effect of geography, as prominent New York and Los Angeles based high schools appear to produce more graduates who work on quality films compared to many colleges and universities,“ says Ravi Iyer, Ranker’s Principal Data Scientist, a graduate of the University of Southern California.
Ranker is able to combine factual semantic data with an opinion layer because Ranker is powered by a Virtuoso triple store with over 700 million triples of information that are processed into an entertaining list format for users on Ranker’s consumer facing website, Ranker.com. Each month, over 7 million unique users interact with this data – ranking, listing and voting on various objects – effectively adding a layer of opinion data on top of the factual data from Ranker’s triple store. The result is a continually growing opinion graph that connects factual and opinion data. As of January 2013, Ranker’s opinion graph included over 30,000 nodes with over 5 million edges connecting these nodes.
- Ravi Iyer
Incoming search terms:
- best film schools
- top film schools
- best film schools in the world
- top film schools in the world
- top film schools 2013
- best film school in the world
- best film school
- best film schools 2013
- film school rankings
- top film schools in world
I want to invest in "big data stocks". After all, everyone is saying that big data is the future of health care, education, government, business, and will literally change the world. As someone who works with data both as an academic at USC and as the principal data scientist at Ranker, I am the type of person who is likely to make and believe in such hyperbolic claims. I recently put money into my IRA and needed to invest it and as someone who believes in investing in what I know about, naturally I wanted to invest in our data driven future.
Where should I invest? If you look around the internet, you'll find a number of recommendations from places like Forbes or The Street. The general consensus appears to be to take the "picks and shovels" approach to investing in big data, where you invest in the companies that make the tools that enable people to use data, rather than in the data itself. I'm writing this post because I think this is absolutely the wrong approach. I believe in investing in data, not in tools. Why do I believe that?
- My experience in academia has taught me that simple statistics and tools are often the most reliable. If there is signal to be detected, any analysis and/or tool should be able to find it. Many people turn to more complex statistics when they don't find the right relationship using simple statistics. In psychology, people are finding that the use of more complex models (e.g. covariates) is often an indicator that the study's results may be less likely to be reliable. Given the size of datasets that we often have in data science, we often don't need special statistical techniques to find relationships in data as we have so much statistical power that most tools and techniques should give you convergent results. Put simply, the tools matter less than the data.
- The most popular tools and techniques are often open source. You can do a lot with R, Python, Gephi, Mahout, etc.
- Yes, there are advantages to using particular distributions of open source tools (e.g. Hadoop distributions that come with particular features), but there are so many companies out there offering different flavors of products that do essentially the same thing, that I can't see how any particular company is going to be the next Apple or Google, in terms of stock growth. There are no barriers to entry in the tools market. Perhaps a company will be the next RedHat, which may be a fine business to be in, but I don't believe that that is the revolutionary wave that investors in big data stocks are looking for.
So what should you do if you want to invest in big data? Buy stock in companies that have the best, biggest, most unique sets of data and/or the most defensible ways of collecting that data. I invested my IRA money into Facebook, which has the biggest and best dataset of human behavior that ever existed. I invest my academic time into scalable data collection projects such as YourMorals, BeyondThePurchase, and ExploringMyReligion, confident that that will lead to the most long-term knowledge. And I invest my professional time into Ranker, which has a scalable process for collecting an opinion graph, that will be essential for the kinds of intelligent applications that big data futurists have been promising us.
Do you want to invest in big data? Generally, you'll get better returns if you invest your money, time, and energy in data, rather than in tools.
- Ravi Iyer
A number of data scientists have attempted to predict movie box office success from various datasets. For example, researchers at HP labs were able to use tweets around the release date plus the number of theaters that a movie was released in to predict 97.3% of movie box office revenue in the first weekend. The Hollywood Stock Exchange, which lets participants bet on the box office revenues and infers a prediction, predicts 96.5% of box office revenue in the opening weekend. Wikipedia activity predicts 77% of box office revenue according to a collaboration of European researchers. Ranker runs lists of anticipated movies each year, often for more than a year in advance, and so the question I wanted to analyze in our data was how predictive is Ranker data of box office success.
However, since the above researchers have already shown that online activity at the time of the opening weekend predicts box office success during that weekend, I wanted to build upon that work and see if Ranker data could predict box office receipts well in advance of opening weekend. Below is a simple scatterplot of results, showing that Ranker data from the previous year predicts 82% of variance in movie box office revenue for movies released in the next year.
The above graph uses votes cast in 2011 to predict revenues from our Most Anticipated 2012 Films list. While our data is not as predictive as twitter data collected leading up to opening weekend, the remarkable thing about this result is that most votes (8,200 votes from 1,146 voters) were cast 7-13 months before the actual release date. I look forward to doing the same analysis on our Most Anticipated 2013 Films list at the end of this year.
- Ravi Iyer
Incoming search terms:
- predicting box office success
- data ranker com predicting-box-office-success-ranker-data
- box office
- google box office prediction
- google olivia munn reader poll
- scientific data box office predictions
- website that analyzes data from twitter to predict a movie box office
A lot of the questions on Ranker are subjective, but that doesn’t mean that we cannot use data to bring some objectivity to this analysis. In the same way that Yelp crowdsources answers to subjective questions about restaurants and TripAdvisor crowdsources answers to subjective questions about hotels, Ranker crowdsources answers to a broader assortment of relatively subjective questions such as the Tastiest Pizza Toppings, the Best Cruise Destination, and the Worst Way to Die.
A few weeks ago, I did an informal talk on the Wisdom of Crowds approach that Ranker takes to crowdsource such answers at a Los Angeles bar as part of “Nerd Nite”. The gist of it is that one can crowdsource objective answers to subjective questions by asking diverse groups of people questions in diverse ways. Greater diversity, when aggregated effectively, enables the error inherent in answering any subjective question to be minimized. For example, we know intuitively that relying on only the young or only the elderly or only people in cities or only people who live in rural areas gives us biased answers to subjective questions. But when all of these diverse groups agree on a subjective question, there is reason to believe that there is an objective truth that they are responding to. Below is the video of that talk.
If you want to see a more formal version of this talk, I’ll be speaking at greater length on Ranker’s methodologies at the Big Data Innovation Summit in San Francisco this Friday.
- Ravi Iyer
Incoming search terms:
- subjective questions vs objective questions
- subjective questions
- objective questions of latest movies 2013
- objective e subjective questions como usar
- crowdsourcing interest graph
- subjective question
- statistics subjective question anser
- objective questions and subjective questions
- objective questions all subjective questions
- objective opinions about inherently subjective
I was recently asked about the Moral Foundations scores of those who are more concerned about the environment and so I analyzed the 15,522 individuals who took the Moral Foundations Scale on YourMorals.org and also answered a question on the Schwartz Values Scale concerning how much of a guiding principle of their life it was to "Protect the Environment". I limited this analysis to those who placed themselves on the liberal-conservative spectrum, so that I could also control both for ideology and extremity of ideology, to some degree. The results (beta weights controlling for other variables) of the regression analyses, predicting a desire to "Protect the Environment", are below.
My initial intuition was that ideology would be the greatest predictor, given how political the issue has become, but it appears that the Care/Harm foundation actually predicts as much unique variance as ideological identification. From an intuitionist standpoint, this makes sense as the specific care you feel for Polar Bears may drive one's values more than more abstract concerns about the ocean's water level, similar to the way that charities appeal to emotions with specific cases of need as opposed to statistics. Still a great deal of variance is indeed predicted by which ideological team you are on.
Also interesting to me was the significant, but small, negative relationship between ingroup loyalty and attitudes toward the environment. The item I used from the Schwartz Values scale is part of a subscale designed to measure Universalism, which relates to Peter Singer's idea that we should expand our moral circles. While it is certainly possible to care both about one's smaller circle/family and one's larger circle/animals/trees, there is some tension there, especially in a world with limited resources where environmental choices that benefit the world at large, may negatively impact one's local community.
There are certainly limitations to these results taken from a particular sample, so take them with a grain of salt. And there remains a healthy debate about which moral concerns are more central, so there certainly are moral concerns that may predict environmental attitudes that are not measured here. Still, these results converge well with what we see in the world. Environmentalists tend to be liberals who are particularly concerned about the welfare of distant others, perhaps expanding their moral circle to include animals, oceans, and trees.
- Ravi Iyer