According to IBM, the computers with which we have surrounded ourselves are now generating 2.5 quintillion bytes of data a day around the world. That’s about half a CD’s worth of data per person per day. “Big data” is the topic of countless breathless conference presentations and consultants’ reports. What, then, might it contribute to economics?
Not everyone means the same thing when they talk about “big data”, but here are a few common threads. First, the dataset is far too big for a human to comprehend without a lot of help from some sort of visualisation software. The time-honoured trick of plotting a scatter graph to see what patterns or anomalies it suggests is no use here. Second, the data is often available at short notice, at least to some people. Your mobile phone company knows where your phone is right now. Third, the data may be heavily interconnected – in principle Google could have your email, your Android phone location, knowledge of who is your friend on the Google Plus social network, and your online search history. Fourth, the data is messy: videos that you store on your phone are “big data” but a far cry from neat database categories – date of birth, employment status, gender, income.
This hints at problems for economists. We have been rather spoiled: in the 1930s and 1940s pioneers such as Simon Kuznets and Richard Stone built tidy, intellectually coherent systems of national accounts. Literally billions of individual transactions are summarised as “UK GDP in 2012”; billions of price movements are represented by a single index of inflation. The data come in nice “rectangular” form – inflation for x countries over y years, for instance.
The big data approach is very different. Take, for instance, credit card data. In principle Mastercard has a wonderful dataset: it knows who is spending how much, where, and on what kind of product, and it knows it instantly. But this is what economists Liran Einav and Jonathan Levin call a “convenience sample” – not everyone has a Mastercard, and not everyone who has one will use it much.
It would be astonishing if the Mastercard dataset couldn’t tell economic researchers something useful, but it’s very poorly matched to the kind of data we normally use or even the kind of questions we normally ask. We like to find causal links, not just patterns – and for everyone, or a representative sample of people, not for an arbitrary sub-group.
Perhaps it’s no surprise that the most immediate use of big data in economics has been in forecasting (or “nowcasting”), which has always been a pragmatic and academically-marginal activity in economics. Analyses of tweets, of Google searches for unemployment benefit or motor insurance, or of trackers of trucks in Germany, have been used ad hoc to understand how the economy is doing, and they seem to work well enough. MIT’s “billion prices project” provides daily estimates of inflation from around the world.
More traditional attempts to use big data have been influential. For instance, Raj Chetty, John Friedman and Jonah Rockoff linked administrative data on 2.5m schoolchildren from New York City to information on what they earned as adults decades later. A single year’s exposure to a poor teacher turns out to have large and long-lasting effects on career success. Amy Finkelstein and a team of colleagues evaluated Medicaid, the low-income US healthcare programme, linking data on hospital records to credit history and other variables. Without large datasets such research would be impossible.
These recent studies promise much more to come for economics. But to take full advantage of the data revolution, the profession will have to change both what it recognises as a question, and what it recognises as an answer.
Tim Harford is the presenter of Radio 4’s “More or Less”
To comment, please email firstname.lastname@example.org