Do Social Scientists Know What They're Talking About?

The world is lousy with experts. They are everywhere: opining in op-eds, prognosticating on television, tweeting out their predictions. These experts have currency because their opinions are, at least in theory, grounded in their expertise. Unlike the rest of us, they know what they’re talking about.

But do they really? The most famous study of political experts, led by Philip Tetlock at the University of Pennsylvania, concluded that the vast majority of pundits barely beat random chance when it came to predicting future events, such as the winner of the next presidential election. They spun out confident predictions but were never held accountable when their predictions proved wrong. The end result was a public sphere that rewarded overconfident blowhards. Cable news, Q.E.D.

While the thinking sins identified by Tetlock are universal - we’re all vulnerable to overconfidence and confirmation bias - it’s not clear that the flaws of political experts can be generalized to other forms of expertise. For one thing, predicting geopolitics is famously fraught: there are countless variables to consider, interacting in unknowable ways. It’s possible, then, that experts might perform better in a narrower setting, attempting to predict the outcomes of experiments in their own field.

A new study, by Stefano DellaVigna at UC Berkeley and Devin Pope at the University of Chicago, aims to put academic experts to this more stringent test. They assembled 208 experts from the fields of economics, behavioral economics and psychology and asked them to forecast the impact of different motivators on the performance of subjects performing an extremely tedious task. (They had to press the “a” and “b” buttons on their keyboard as quickly as possible for ten minutes.) The experimental conditions ranged from the obvious - paying for better performance - to the subtle, as DellaVigna and Pope also looked at the influence of peer comparisons, charity and loss aversion.  What makes these questions interesting is that DellaVigna and Pope already knew the answers: they’d run these motivational studies on nearly 10,000 subjects. The mystery was whether or not the experts could predict the actual results.

To make the forecasting easier, the experts were given three benchmark conditions and told the average number of presses, or “points,” in each condition. For instance, when subjects were told that their performance would not affect their payment, they only averaged 1521 points. However, when they were paid 10 cents for every 100 points, they averaged 2175 total points. The experts were asked to predict the number of points in fifteen additional experimental conditions.

The good news for experts is that these academics did far better than Tetlock’s pundits. When asked to predict the average points in each condition, they demonstrated the wisdom of crowds: their predictions were off by only 5 percent. If you’re a policy maker, trying to anticipate the impact of a motivational nudge, you’d be well served by asking a bunch of academics for their opinions. 

The bad news is that, on an individual level, these academics still weren’t very good. They might have looked prescient when their answers were pooled together, but the results were far less impressive if you looked at the accuracy of experts in isolation. Perhaps most distressing, at least for the egos of experts, is that non-scientists were much better at ranking the treatments against each other, forecasting which conditions would be most and least effective. (As DellaVigna pointed out in an email, this is less a consequence of expert failure and more a tribute to the fact that non-experts did “amazingly well” at the task.) The takeaway is straightforward: there might be predictive value in a diverse group of academics, but you’d be foolish to trust the forecast of a single one.

Furthermore, there was shockingly little relationship between the credentials of academia and overall performance. Full professors tended to underperform assistant professors, while having more Google Scholar citations was correlated with lower levels of accuracy. (PhD students were “at least as good” as their bosses.) Academic experience clearly has virtues. But making better predictions about experiments does not seem to be one of them.

Since Tetlock published his damning critique of political pundits, he has gone on to study so-called “superforecasters,” those amateurs whose predictions of world events are consistently more accurate than those of intelligence analysts with access to classified information. (In general, these superforecasters share a particular temperament: they’re willing to learn from their mistakes, quick to update their beliefs and tend to think in shades of gray.) After mining the data, DellaVigna and Pope were able to identify their own superforecasters. As a group, these non-experts significantly outperformed the academics, improving on the average error rate of the professors by more than 20 percent. These people had no background in behavioral research. They were paid $1.50 for 10 minutes of their time. And yet, they were better than the experts at predicting research outcomes.

The limitations of expertise are best revealed by the failure of the experts to foresee their own shortcomings. When the academics were surveyed by DellaVigna and Pope, they predicted that high-citation experts would be significantly more accurate. (The opposite turned out to be true.) They also expected PhD students to underperform the professors – that didn’t happen, either – and that academics with training in psychology would perform the best. (The data points in the opposite direction.)

It’s a poignant lapse. These experts have been trained in human behavior. They have studied our biases and flaws. And yet, when it comes to their own performance, they are blind to their own blindspots. The hardest thing to know is what we don’t.

DellaVigna, Stefano, and Devin Pope. Predicting Experimental Results: Who Knows What? NBER Working Paper, 2016.