July 22, 2014 6:15 pm

Google catches cold as debate over ‘big data hubris’ rages

A nurse prepares to give a flu vaccination©Getty

This won’t hurt a bit: a nurse in Hong Kong prepares to give a flu vaccination

It was once a symbol of the power of big data. Google Flu Trends was supposed to provide an early warning system for looming epidemics by analysing internet search terms for signs that people were coming down with the bug.

The concept – easy to understand and unambiguously good for society – became a favourite of commentators and policy makers evangelising about big data’s benefits.

Six years after its launch, however, Google Flu Trends (GFT) is now more often cited as an example of the limitations and dangers of over-reliance on online data.

During the 2012-13 flu season, GFT predicted 10.6 per cent of the US population had influenza-like illness when subsequent patient data showed the true figure was 6.1 per cent. The algorithm was improved for the 2013-14 season but still GFT overestimated cases by 30 per cent.

Google’s shortcomings were laid bare in March, when researchers from Northeastern university in Boston, Harvard and elsewhere published a paper in Science magazine called “The parable of Google Flu: Traps in Big Data Analysis”.

GFT, they said, was an example of “big data hubris” involving “the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis”.

Analysing flu reports provided to the US Centers for Disease Control and Prevention (CDC) by doctors remained more accurate than Google’s predictions, the researchers found, even though there is a two-week lag in the data.

“The comparative value of the [GFT] algorithm as a standalone flu monitor is questionable,” the paper concluded.

What went wrong? Problems included people searching for information on flu symptoms when they really only have a cold; or because they are worried about getting it; or because they have been prompted by media coverage of flu outbreaks.

Moreover, when people search for information about flu – or anything else – through Google, a list of related search prompts encourages people to make further searches on similar subjects. This risks causing a snowballing in flu-related searches that distorts the data.

During its design phase, Google software engineers analysed more than 50m search terms for potential correlations with CDC data on reported flu cases in prior years.

Some of the strongest correlations involved searches such as “Oscar nominations” and the “March Madness” US College basketball series which tend to coincide with peak flu season. Unhelpful examples such as these were filtered out and 45 search terms settled on that appeared to be good indicators of flu activity.

The flaws in the algorithm have been seized on by sceptics who believe the benefits of big data have been overblown. However, Google’s own software engineers were open about its limitations when they launched GFT in 2008.

“This system is not designed to be a replacement for traditional surveillance networks or supplant the need for laboratory-based diagnosis,” they wrote in Nature magazine. “The data are most useful as a means to spur further investigation and collection of direct measures of disease activity.”

This was precisely the conclusion reached by this year’s Science paper on the limitations of GFT. Beyond the headlines on “big data hubris”, the researchers acknowledged that Google data could improve the accuracy of flu forecasts when combined with CDC data.

Other academics have since stepped in to defend the concept of using big data to improve epidemiology even if Google’s first attempt was flawed.

A report from researchers at Harvard University and elsewhere in July concluded that the problems were mainly methodological, raising the prospect that GFT could become more accurate. “A methodological problem has a methodological solution,” they wrote.

One possible way to build a more robust model emerged from a Pennsylvania State University study published in July, which claimed to have diagnosed with 99 per cent accuracy people with flu based on their social media activity.

Whereas GFT was based on correlations between search terms and population-wide flu data, the Pennsylvania researchers based their model on 104 individuals who had been professionally diagnosed with the virus in the 2012-13 winter.

The researchers looked at the Twitter accounts of those people to see if they left clues about their illness when they were suffering from flu.

Just under half the people referred directly to their condition in their own tweets. Yet, by analysing other patterns of usage, the researchers were able to come up with a model that accurately diagnosed even those who did not mention their flu.

This was done through analysis of text searches, how they interacted with their Twitter “followers”, and the intensity with which they were using the site compared with when they did not have flu.

The Pennsylvania researchers believe that basing disease-tracking algorithms on the online behaviour of people known to have had the disease could be the key to more accurate predictions. But they also acknowledge the privacy concerns surrounding such methods.

While their study focused on flu, they noted that the same technique could be used to identify people with more “stigmatised diseases”, such as HIV, “where being able to determine if an individual is HIV positive without her knowledge and with only her Twitter handle could result in serious social and economic effects”.

They concluded: “It would seem that simply avoiding discussing an illness is not enough to hide one’s health in the age of big data.”

Related Topics

Copyright The Financial Times Limited 2015. You may share using our article tools.
Please don't cut articles from FT.com and redistribute by email or post to the web.